Link Analysis: An Information Science Approach (Library and Information Science) (Library and Information Science)

Link AnalysisAn Information Science Approach

Recent and Forthcoming Volumes

Leo EgghePower Laws in the Information Production Process: Lotkaian Informetrics

Donald CaseLooking for Information

Matthew Locke Saxton and John V. RichardsonUnderstanding Reference Transactions: Turning Art Into a Science

Robert M. HayesModels for Library Management, Decision-Making, and Planning

Charles T. Meadow, Bert R. Boyce, and Donald H. KraftText Information Retrieval Systems, Second Edition

Charles T. MeadowText Information Retrieval Systems

A.J. MeadowsCommunicating Research

V. Frants,J. Shapiro, & V. VotskunskiiAutomated Information Retrieval: Theory and Methods

Harold SackmanBiomedical Information Technology: Global Social Responsibilities for theDemocratic Age

Peter ClaytonImplementation of Organizational Innovation: Studies of Academic and ResearchLibraries

Bryce L. AllenInformation Tasks: Toward a User-Centered Approach to Information Systems

Library and Information Science

Series Editor: Bert R. BoyceSchool of Library & Information ScienceLouisiana State University, Baton Rouge

Mike Thelwall

2004

ELSEVIERACADEMIC

PRESS

Amsterdam - Boston - Heidelberg - London - New York - OxfordParis - San Diego - San Francisco - Singapore - Sydney - Tokyo

Link AnalysisAn Information Science Approach

ELSEVIER B.V. ELSEVIER Inc. ELSEVIER Ltd. ELSEVIER Ltd.Radarweg 29 525 B Street, Suite 1900 The Boulevard, Langford Lane 84 Theobalds RoadP.O. Box 211, 1000 AE Amsterdam San Diego, CA 92101-4495 Kidlington, Oxford OX5 1GB London WC1X 8RRThe Netherlands USA UK UK

© 2004 Elsevier Inc. All rights reserved.

This work is protected under copyright by Elsevier Inc., and the following terms and conditions apply to its use:

Photocopying

Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permissionof the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying,copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are availablefor educational institutions that wish to make photocopies for non-profit educational classroom use.

Permissions may be sought directly from Elsevier's Rights Department in Oxford, UK: phone (+44) 1865 843830, fax(+44) 1865 853333, email: [email protected]. Requests may also be completed on-line via the Elsevierhomepage (http://www.elsevier.com/locate/ permissions).

In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK throughthe Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK;phone: (+44) 20 7631 5555; fax: (+44) 20 7631 5500. Other countries may have a local reprographic rights agency forpayments.

Derivative WorksTables of contents may be reproduced for internal circulation, but permission of the Publisher is required for externalresale or distribution of such material. Permission of the Publisher is required for all other derivative works, includingcompilations and translations.

Electronic Storage or UsagePermission of the Publisher is required to store or use electronically any material contained in this work, including anychapter or part of a chapter.

Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any formor by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of thePublisher.Address permissions requests to: Elsevier's Rights Department, at the fax and e-mail addresses noted above.

NoticeNo responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of productsliability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas containedin the material herein. Because of rapid advances in the medical sciences, in particular, independent verification ofdiagnoses and drug dosages should be made.

First edition 2004

ISBN: 0-12-088553-0

@ The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper).Printed in The Netherlands.

Introduction v

Link Analysis: An Information Science Approach

Part I: Theory 11 1Introduction 1

Objectives 1Link analysis 1Historical overview 2What is the information science approach to link analysis? 3Contents and structure 4Key terminology 5Summary 6Further reading 6References 7

2 9Web crawlers and search engines 9

Objectives 9Introduction 9Web crawlers 9

Finding pages 11Content crawling vs. URL crawling 11Content crawling 14Obscured links 14Depth and other arbitrary limitations 15Automatically generated pages 15Ethical issues and robots.txt 17The web page 17Web crawling summary 18

Search engines 18Known biases 19Search engine ranking 20

The Internet Archive 20Summary 20Further reading 21References 21

3 23The theoretical perspective for link counting 23

Objectives 23Introduction 23The theoretical perspective for link counting 23Anomalies 24Manual filtering and banned lists 26Alternative Document Models 27

Web sites and web documents 27ADMs and standard ADM counting 29ADM range counting models 30

Choosing link counting strategies 31

vi Link Analysis: An Information Science Approach

Summary 32Further reading 32References 33

4 35Interpreting link counts: Random samples and correlations 35

Objectives 35Introduction 35Interpreting link counts 35The pilot feasibility and validity study 37Full-scale random sampling 38Confidence limits for categories 40Correlation testing 41Literature review 43Summary 43Further reading 43References 44

Part II: web structure 475 47Link structures in the web graph 47

Objectives 47Introduction 47Power laws in the web 48Models of web growth 50Link topologies 52Power laws and link topologies in academic webs 54Summary 55Further reading 56References 56

6 59The content structure of the web 59

Objectives 59Introduction 59The topic structure of the web 60A link-content web growth model 61Link text 62The subject structure of academic webs 62Colinks 66Summary 66Further reading 67References 67

III Academic links 697 69Universities: Link types 69

Objectives 69Introduction 69Citation analysis 69The role of a university web site 70

Introduction vii

National systems of university web sites 71Page types 72Link types 75Summary 77Further reading 78References 78

8 81Universities: Link models 81

Objectives 81Introduction 81The relationship between inlinks and research 81Academic linking: Quality vs. quantity 84Alternative logical linking models 86Mathematical models 87The influence of geography 88Regional groupings 89Summary 91References 91

9 93Universities: International links 93

Objectives 93Introduction 93National vs. international links 94International linking comparisons 95Linguistic influences 96Summary 98Further reading 99References 99

10 101Departments and subjects 101

Objectives 101Introduction 101Departmental web sites 102Disciplinary differences in link types 103issues of scale and correlation tests 104

Country 105Subject 105Outcome 105

Geographic and international factors 106Summary 106Further reading 107References 107

11 109Journals and articles 109

Objectives 109Introduction 109Journal Impact Factors 109Journal web sites 110

viii Link Analysis: An Information Science Approach

Journal web site inlinks: Issues I l lJournal web site inlinks: Case study 112Types of links in journal articles 113Digital library links 114Combined link and log file analysis 114Related research topics 115Summary 116Further reading 116References 116

IV Applications 11912 119Search engines and web design 119

Objectives 119Introduction 119Link structures and crawler coverage 119Text in web sites and the Vector Space Model 120The PageRank algorithm 121Case study: PageRank calculations for a gateway site 124HITS 127HITS worked example 128Summary: Web site design for PageRank and HITS 131Further reading 132Appendix: the Vector Space Model 133References 134

13 137A health check for Spanish universities 137

objective 137Introduction 137Research questions 137Methods 138Results and discussion 138Conclusion 144References 144

14 145Personal web pages linking to universities 145

Objectives 145Introduction 145Web publishing and personal home pages 146Research questions 147Methods 148

Data collection 148Data analysis 149

Results 151ISP bias test 151ADM fitting 152Correlations between links and research ratings 153A comparison of university and home page link sources 154

Introduction ix

Individual page categorizations 155Conclusion 158Meta-conclusions 159Acknowedgement 159References 160

15 163Academic networks 163

Objectives 163Introduction 163Methods 163University sitemaps 164National academic web maps 168Subject maps 170Summary 171Further reading 171References 172

16 173Business web sites 173

Objectives 173Introduction 173Site coverage checks 173Site indexing and ranking checks 174Competitive intelligence 174Case study 175

Center Pares 176Hoseasons 176Butlins 177Pontins 178Haven Holidays 178General queries 179

Summary 179Further reading 180References 180

V Tools and techniques 18117 181Using commercial search engines and the Internet Archive 181

Objectives 181Introduction 181Checking results 182Dealing with variations in results 183Using multiple search engines 184Using the Internet Archive 184Summary 185Online resources 185Further reading 186References 186

18 189Personal crawlers 189

x Link Analysis: An Information Science Approach

Objectives 189Introduction 189Types of personal crawler 189SocSciBot 190

Web page retrieved 190Web page qualification 191Web link extraction 192URLs from HTTP 192Obscured or unspecified URLs 193Server-generated pages 193Dealing with errors 194Human intervention during crawls 195

SocSciBot tools 195Summary 196Online resources 196Further reading 196References 197

19 199Data cleansing 199

Objectives 199Introduction 199Overview of data cleansing techniques 199Anomaly identification 200TLD Spectral Analysis 201Summary 201Online resources 202References 202

20 203Online university link databases 203

Objective 203Introduction 203Overview of the link databases 203Link structure files 204The banned lists 205Analyzing the data 206Other link structure databases 206Summary 206Online resources 206Further reading 206Reference 208

21 209Embedded link analysis methodologies 209

Objectives 209Introduction 209Web Sphere Analysis 210Virtual ethnography 210Summary 211

Introduction xi

Further reading 212References 212

22 213Social Network Analysis 213

Objectives 213Introduction 213Some SNA metrics 214Software 215Summary 216Further reading 216References 216

23 219Network visualizations 219

Objectives 219Introduction 219Network diagrams 219Large network diagrams 221MultiDimensional Scaling 221Self-Organizing Maps 222Knowledge Domain Visualisation 223Summary 223Online resources 223References 223

24 227Academic link indicators 227

Objective 227Introduction 227Web indicators as process indicators 228Issues of size and reliability 228Benchmarking indicators 230Link metrics 230Relational indicators 232Other metrics 232Summary 233Further reading 233References 234

VI Summary 23725 237Summary 237

Objectives 237Introduction 237information science contributions to link analysis 238Other link analysis approaches 239Future directions 240

26 241Glossary 241

References 243Appendix 245

xii Link Analysis: An Information Science Approach

A SocSciBot tutorial 245Tutorial 245

Step 1: Installing SocSciBot, SocSciBot Tools and Cyclist 245Step 2: Installing Pajek 247Step 3: Crawling a first site with SocSciBot 247Step 4: Crawling two more sites with SocSciBot 252Step 5: Viewing basic reports about the "small test" project with SocSciBot Tools 253Step 6: Viewing a network diagram with Pajek 257Step 7: Viewing site diagrams with Pajek 261Step 8: Using Cyclist 263

Summary 264Index 265

Introduction 1

PART I: THEORY

INTRODUCTION

OBJECTIVES

• To introduce the content and structure of the book and some key terminology.• To outline the information science approach to link analysis.

LINK ANALYSIS

Link analysis is performed in very diverse subjects, from computer science and theoreticalphysics to information science, communication studies and sociology. This is a testament bothto the importance of the web and to a widespread belief that hyperlinks between web pagescan yield useful information of one kind or another. This belief probably stems from severalrelated factors: the success of Google, which uses a link-based algorithm for identifying thebest pages; analogies with other phenomena, such as journal citations and social connections;and probably also links being 'in your face' all the time, whether using the web for research,business or recreation.

In this book, an information science approach to link analysis is set out with theprinciple aim of introducing it to a new audience. This new audience will then be able tocritically evaluate existing research and develop their own research projects and methods. It isa central belief of this book that the information science approach is widely useful to otherresearchers, particularly social scientists interested in analyzing phenomena with an onlinecomponent. No attempt is made to give comprehensive coverage of all different types of linkanalysis: such an enterprise would fail between the detail of the mathematics used in someareas and the qualitative approach used in others. The information science theme of the book

1

2 Link Analysis: An Information Science Approach

has resulted in at least half of its content being related to the study of academic web use orscholarly communication. Readers may therefore also gain additional insights into scholarlycommunication.

The book seeks to answer four main questions.• Which kinds of information can be extracted by analyzing the hyperlinks between a

set of web pages or sites?• Which techniques should be used?• What are the likely pitfalls of link analysis?• How can and should a link analysis be conducted in practice?

HISTORICAL OVERVIEW

The start of published web link analysis research appears to date from 1995-1996, occurringsimultaneously in several disciplines, including computer science for search enginedevelopment (e.g., Weiss, Velez, Sheldon et al., 1996), and mathematics for structure andcomplexity analysis (e.g., Abraham, 1996). The first information scientist to publish adiscussion of the potential for transferring information science techniques to the Internetappears to be the Brazilian Marcia J. Bossy (1995), with an article in a French online journal.The first published information science link analysis seems to be that of Larson (1996). His"Bibliometries of the World Wide Web: An exploratory analysis of the intellectual structureof cyberspace" presentation at the American Society for Information Science conferenceexplicitly adapted existing information science techniques from bibliometrics to the web.Larson's objective was to assess the link structure of a topic on the web (Earth Sciences) andthe characteristics of highly linked to documents.

Shortly following Larson's presentation, a number of other information scientists alsorealized that advanced features of search engines could be used for an information science-style link analysis. This produced Rousseau's (1997) informetric analysis of the web andRodriguez f Gairin's (1997) web citation analysis, the latter describing the search engineAltaVista as the web's 'citation index'.

Two other important developments occurred in parallel with the genesis of linkanalysis: the foundation of a journal and the development of a theoretical orientation forinformation science web research. Almind and Ingwersen (1997) coined the term'webometrics' for the quantitative analysis of web-related phenomena from an informationscience perspective. Most webometrics research has, so far, focused on hyperlinks, althoughthere have also been quantitative analyses of search engine results and longitudinalinvestigations into web page changes. The term 'cybermetrics' emerged at the same time aswebometrics and is almost synonymous: the difference being that cybermetrics includesquantitative analysis of the Internet, not just the web. A key instigator of this term was IsidroAguillo, who founded the e-journal Cybermetrics in 1997.

Since 1997, there have been a large number of link analysis studies taking aninformation science approach (Thelwall, Vaughan & Bjomeborn, 2005). These havecollectively produced the developed body of theory and methods that is summarized in thisbook.

Introduction 3

WHAT IS THE INFORMATION SCIENCE APPROACH TO LINKANALYSIS?

The information science approach to link analysis is to adopt and adapt existing informationscience techniques for the meta-analysis of documents through investigating inter-documentconnections. This set of existing techniques is part of two overlapping fields of study:bibliometrics, the quantitative analysis of documents; and scientometrics, the quantitativeanalysis of science and its outputs. Within the overlap of these two fields a number oftechniques for analyzing scientific publications have been developed, principally for journalarticles and patents, and using citations as the key inter-document connectors. The surfacesimilarity between hyperlinks and citations is that they are both directional links betweendocuments, often documents created by different authors. There is an extensive body ofresearch and theory concerning citations (e.g., Borgman & Furner, 2002) that serves as astarting point for an information science approach to link analysis. There is a historicalparallel: citation analysis techniques have been adapted from their original informationscience home of journal citations to patent citations (Oppenheim, 2000), in response to theincreasing commercialization of research.

An information science approach to link analysis1) Formulate an appropriate research question, taking into account

existing knowledge of web structure (>chapters 5, 6, and chapters7-16 as appropriate).

2) Conduct a pilot study (>chapter 4).3) Identify web pages or sites that are appropriate to address a

research question.4) Collect link data from a commercial search engine or a personal

crawler, taking appropriate safeguards to ensure that the resultsobtained are accurate (>chapter 17 or 18).

5) Apply data cleansing techniques to the links, if possible, and selectan appropriate counting method (>chapters 3 and 19).

6) Partially validate the link count results through correlation tests(>chapter 4).

7) Partially validate the interpretation of the results through a linkclassification exercise (>chapter 4).

8) Report results with an interpretation consistent with linkclassification exercise, including either a detailed description of theclassification or exemplars to illustrate the categories (>chapter 4).

9) Report the limitations of the study and parameters used in datacollection and processing (stages 3 to 5) (>chapters 3, 4)

The information science approach to link analysis is outlined in the box above. Those familiarwith citation analysis will see strong parallels, but these are not directly commented upon.There are two central themes, the first being information. The objective of the link analysis isto deliver useful information. Other types of link analysis may have different objectives, suchas identifying abstract mathematical patterns or improving the performance of web


information retrieval algorithms. In contrast, humans are the end users for the informationscience approach and the information delivered to them typically relates to the contents of theweb pages or their authors/owners.

The second information science theme is methodological soundness, particularlyvalidity and reliability of results. This is again in contrast to other applications, for whichvalidity and reliability are not essential. For example, commercial search engines exploitinglink analysis only need it to deliver an overall improvement in their service to users and not tosatisfy any information-centered research criteria.

Note that the stages in the box are not present in all of the research discussed in thisbook, particularly the non-information science link analysis.

CONTENTS AND STRUCTURE

This book is a hybrid creation: partly online and partly offline; partly text and partly softwareand data; partly free and partly for sale. The boundaries are blurred so that some book userswill not realize that the print book exists at all. The contents are as follows.

Text: an information science theory of information science link analysis, supported by resultsand theory from other fields, case studies and overviews of specific link analysis methods.This is the conventional book. The print part of the book is split into six parts.

Part I Theory: introduces the theory of information science link analysis, includingbasic methods.Part II Web structure background: surveys research from other subject areas that giveuseful background information to help interpret the results of link analysisinvestigations and, equally importantly, to build intuition about how links are used onthe web.Part III Academic links: focuses on academic link analysis. This has two purposes.The first is the topic itself: to give a comprehensive survey of state-of-the art (2004)research into how academic-related links can be used and interpreted. The secondpurpose is to illustrate and describe in detail the methods of information science linkanalysis. A central part of this is a discussion of how useful information can beextracted from link counts.Part IV Applications: presents a series of complete link analysis case studies. Theseare intended to illustrate a range of different applications and also the finer details ofindividual research projects. Part IV may be skimmed or read selectively.Part V Tools and techniques: describes methods and software tools that are useful inlink analysis. Detailed instructions for various tools are given online, whereas thechapters give a more general description of their link analysis capabilities. Part V isaimed at those intending to conduct their own link analysis research.Part VI Summary: summarizes the key components of the information scienceapproach to link analysis.

Online text: up-to-date instructions on using search engines and different types of software forlink analysis. This part of the book is kept online so that it can be updated as search enginesand other software evolve and emerge. This allows the conventional book to be relatively free

Introduction 5

of material that will date quickly. This is the now conventional 'web site supporting the book'and is free.

Online link analysis software: a web crawler, SocSciBot, and a suite of link analysisprograms, SocSciBot Tools. This allows more scientific studies than achievable withcommercial search engines and also makes it feasible to apply all of the techniques in thisbook without the need to write new computer programs. This is the conventional 'softwarethat implements the techniques described in the book' and is free (see chapters 16 and 19 formore information).

Online link databases: large files of the link structures of many universities, collected since2000 by an information science web crawler, a variant of SocSciBot. These link databasesallow anyone to conduct large-scale link analyses without needing to spend the time crawlingmany large sites. This collection of link files predates the idea of the book and is thereforeonly loosely part of it, and is free (see chapter 19 for more information).

The combination of resources forming this book has the objective of making it as easy aspossible for readers to conduct their own link analysis investigations.

KEY TERMINOLOGY

The following words are used repeatedly.

• Inlink: a link to a web page. If qualified by a web unit, this implies that the link shouldoriginate outside of the specified unit. For example a site inlink is a link to any page ina site from any page in a different site. Similarly, a page inlink is a link to a page froma different page. Inlink is synonymous with 'backlink' and inlinked is synonymouswith 'linked to'.

• Outlink: a link from a web page. If qualified by a web unit, this implies that the linkshould target a page outside of the specified unit. For example a site outlink is a linkfrom any page in a site to any page in a different site. Similarly, a page outlink is alink from a page to a different page.

• Selflink: a link from a web page to the same page, perhaps to a different part of thepage. If qualified by a web unit, this implies that the link should target a page inside ofthe specified unit. For example a site selflink is a link from any page in a site to anypage in the same site. Site selflink is synonymous with 'internal site link', orsometimes just 'internal link'.

• Interlink: normally a link between two different web sites, also referred to as an inter-site link. This is commonly used with the -ing form of the word. For example, website interlinking refers to links between web sites (i.e., site inlinks/site outlinks).

• Link, hyperlink: both refer to a web link. These terms are used when there is no needto distinguish between inlinks and outlinks. They are also occasionally used to refer toinlinks and outlinks, where the context is clear, to give some variation in the text.

• Co-linked: when two pages both have inlinks from a third page. In Figure 1.1, B and Care co-linked by A.


• Co-linking: when two pages both have outlinks to a third page. Sometimes alsodescribed as bibliometric coupling or just coupling.

• Web site: a self-contained collection of one or more pages with a consistent theme. Inline with standard use, the definition is intentionally loose and allows different websites to overlap. Hence the web site of an academic may be within the web site of adepartment, within the web site of a university.

Note that the definitions of inlink and outlink above are perspective-driven: every inlink is anoutlink from the perspective of the source page, and vice versa. This is illustrated in theBjorneborn diagram (Bjorneborn, 2004; Bjorneborn, & Ingwersen, 2005) Figure 1.1. The linky is an outlink from page A, but an inlink to page B. It is also a site selflink withinwww.albany.edu, but x is a site outlink from www.albany.edu and a site inlink towww.mit.edu.

www.albany.edu www.mit.edu

Figure 1.1. Links between three pages in two web sites.

SUMMARY

This book is a hybrid online-offline entity designed to introduce the information scienceapproach to link analysis to a new audience and to make it accessible. In its online component(http://linkanalysis.wlv.ac.uk/) it includes a large repository of data as well as tools forcollecting and processing link data. The information science approach sketched in this chapteris elaborated throughout.

FURTHER READING

For a developed theoretical framework for webometrics and extra terminology for linkanalysis, see Bjorneborn & Ingwersen (2005). For general surveys of web research includingor highlighting link analysis there are a few review articles (Park & Thelwall, 2003; Thelwall,Vaughan & Bjorneborn, 2005; Li, 2003; Wilkinson, Thelwall & Li, 2003). A series of criticalevaluations of quantitative web approaches have been published, and are useful sources ofperspective and caution (Egghe, 2000, van Raan, 2001; Bjorneborn & Ingwersen, 2001).

For background information on bibliometrics, see a 2002 review chapter (Borgman &Furner, 2002) and see also Cronin's (2001) discussion of the potential for the expansion ofbibliometrics to the web. For a deeper general methodological background, Takkadorie and

Introduction 7

Teddlie (1998) is a good book, and Oppenheim's (2000) chapter on transferring citationanalysis to patents is well worth reading.

REFERENCES

Abraham, R.H. (1996). Webometry: measuring the complexity of the World Wide Web.Visual Math Institute, University of California at Santa Cruz. Available:http://www.ralph-abraham.org/vita/redwood/vienna.html

Almind, T.C. & Ingwersen, P. (1997). Informetric analyses on the world wide web:Methodological approaches to "webometrics". Journal of Documentation, 53(4), 404-426.

Bjorneborn, L. (2004). Small-world link structures across an academic web space: a libraryand information science approach. PhD Thesis. Royal School of Library andInformation Science, Copenhagen, Denmark.

Bjorneborn, L. & Ingwersen, P. (2001). Perspectives of webometrics. Scientometrics, 50(1),65-82.

Bjorneborn, L. & Ingwersen, P. (2005, to appear). Towards a basic framework forwebometrics, Journal of the American Society for Information Science and Technology,special issue on webometrics

Borgman, C. & Furner, J. (2002). Scholarly communication and bibliometrics. In: Cronin, B.(ed.), Annual Review of Information Science and Technology 36, Medford, NJ:Information Today Inc., pp. 3-72.

Bossy, MJ. (1995). The last of the litter: "Netometrics". In: Les Sciences de l'information:bibliometrie, scientometrie, Infometrie. Presses Universitaires de Rennes. Also, Solaris,2. Available: http://biblio-fr.info.unicaen.fr/bnum/jelec/Solaris/d02/2bossy.html

Brin, S. & Page, L. (1998). The anatomy of a large scale hypertextual web search engine.Computer Networks and ISDN Systems, 30(1-7), 107-117.

Cronin, B. (2001). Bibliometrics and beyond: Some thoughts on Web-based citation analysis.Journal of Information Science, 27(1), 1-7.

Egghe, L. (2000). New informetric aspects of the Internet: some reflections - many problems.Journal of Information Science, 26(5), 329-335.

Larson, R. (1996). Bibliometrics of the world wide web: An exploratory analysis of theintellectual structure of cyberspace. Proceedings of ASIS96, 71-78. Available:http://sherlock.berkeley.edu/asis96/asis96.html

Li, X. (2003). A review of the development and application of the Web Impact Factor. OnlineInformation Review, 27(6), 407-417.

Oppenheim, C. (2000). Do patent citations count? In: Cronin, B. & Atkins, H.B. (eds.). Theweb of knowledge: a festschrift in honor of Eugene Garfield. Metford, NJ: InformationToday Inc. ASIS Monograph Series, 405-432.

Park, H. W. & Thelwall, M. (2003). Hyperlink analysis: Between networks and indicators,Journal of Computer-Mediated Communication, 8(4). Available:http://www.ascusc.org/jcmc/vol8/issue4/park.html

Rodriguez f Gairfn, J.M. (1997). Valorando el impacto de la informacion en Internet:AltaVista, el "Citation Index" de la Red, Revista Espanola de DocumentacionCientifwa, 20, 175-181.


Rousseau, R. (1997). Sitations: An exploratory study. Cybermetrics, 1(1). Available:http://www.cindoc.csic.eS/cybermetrics/articles/v 1 i 1 p 1 .html

Tashakkori, A. & Teddlie, C. (1998). Mixed methodology: Combining qualitative andquantitative approaches. Thousands Oaks, CA: Sage Publications.

Thelwall, M., Vaughan, L. & Bjorneborn, L. (2005, to appear). Webometrics. In: AnnualReview of Information Science and Technology 39.

van Raan, A.F.J. (2001). Bibliometrics and Internet: some observations and expectations.Scientometrics, 50(1), 59-63.

Weiss, R., Velez, B., Sheldon, M., Manprempre, C, Szilagyi, M., Duda, A., & K. Gifford, D.(1996). HyPursuit: A hierarchical network search engine that exploits content-linkhypertext clustering. Proceedings of the 7th ACM Conference on Hypertext. ACMPress: New York.

Wilkinson, D., Thelwall, M. & Li, X. (2003). Exploiting hyperlinks to study academic Webuse. Social Science Computer Review, 21(3), 340-351.

Web Crawlers and Search Engines 9

WEB CRAWLERS AND SEARCH ENGINES

OBJECTIVES

• To explain the limitations of link data collection methods, both personal web crawlersand search engines.

• To describe how crawlers find web pages.• To review the parameters that a crawler may use.• To explain the page types that cause problems for crawlers.• To review additional issues for using search engine data.

INTRODUCTION

Every practical investigation into links has to obtain link data. In some cases an investigatormay browse a chosen set of pages or sites to identify links of a given type (e.g., Park, Barnett& Nam, 2002). In most cases, however, the links are collected by a web crawler and thendelivered to the researcher in summary form. Web crawlers are programs built toautomatically download pages from the web by following links. Their design has importantimplications for the interpretation of the results of link analysis studies. Fortunately, it is notnecessary to delve into the computer science of web crawlers to understand its impact on linkanalysis. The first half of this chapter deals with issues relevant to interpreting crawler data.

The second half of this chapter deals with additional theoretical considerations thatapply to those getting their data from commercial search engines or the Internet Archive.These relate to the way in which search engines are optimized to deliver useful information totheir users. This optimization can cause variations in results and other problems for linkresearch.

WEB CRAWLERS

A web crawler is a computer program that is capable of retrieving pages from the web,extracting the links from those pages and following the new links. Alternative equivalentnames include crawler, wanderer, spider, robot, and bot. Some commercial software

2


describing itself in other terms - such as downloader, indexer, or link checker - may alsoincorporate a crawler. Web crawlers are normally fed with a single URL or a list of URLs tostart with. The URLs are then visited and after each page has been downloaded, its links areextracted and added to the list to be crawled, if they are not already in the list. Single sitecrawlers are programs that can be given the URL of the home page of a site and then willattempt to crawl the whole site. In addition to the special crawlers designed by researchers(e.g., Thelwall, 2001; Garrido & Halavais, 2003) including SocSciBot, examples includemany web site management programs, such as Microsoft Site Analyst and WebKing.

The Web

Figure 2.1. A basic web crawler

Figure 2.1 illustrates some key tasks of a web crawler. The program starts by being fed crawlparameters and a starting URL or list of URLs. Crawl parameters are discussed in more detailbelow. The page fetcher uses the first of the URLs in the URL store to download a page fromthe web and then passes it on to the duplicate page checker. This checks to see if the pageduplicates one already downloaded and, if so, rejects it. The exactness of the duplication testmay depend on the specific crawl parameters loaded. These parameters are fixed for somecrawlers and duplicate page checking is absent from others. If the page is not rejected, then itwill be saved to the page store and also passed on to the link extractor. The link extractorextracts the links from the page and passes them on to the URL checker. This program willthen test the URLs and reject them if they either have been already seen before, or fail thecriteria specified in the crawl parameter list. In a small web crawler, one of the parameterstypically specifies that the URL must come from the same site as the starting URL. Non-rejected URLs are then passed on to the URL list, which then passes one of the unvisitedURLs back to the page fetcher. The cycle repeats until all of the URLs in the URL list havebeen visited.

A web crawler used in link analysis may create a file or database of link structureinformation in addition to its normal operations. Alternatively, this may be the task of asecond program, operating on the web pages downloaded by the crawler.


Finding pages

The first important theoretical issue concerning web crawlers is that they can only visit pagesthat were in their starting URL(s) list or have been subsequently extracted from crawledpages. This can be seen from their architecture, as illustrated in Figure 2.1. There is oneexception to this rule. Some crawlers guess at home page URLs by truncating any new URLfound at slashes. For example, given the URL http://www.db.dk/lb/home_uk.htm, a crawlermay guess at two home pages and attempt to download three pages in total:http://www.db.dk/lb/home_uk.htm, http://www.db.dk/lb/ and http://www.db.dk/. There is noguarantee that all pages in a site will be found, however. Pages that are not linked to, were notin the initial list, and could not be guessed will be invisible to the crawler. In addition, as thediscussion below will show, it is likely that some pages that are linked to will not be found, orwill be found but not crawled. In a small, well-organized site, however, all pages should befound by following links from the home page. There may be exceptions to the rule such astest pages or old pages that are not intended for public consumption but have been left on theweb server, although not linked. The unavoidable omission of such pages may not be apractical problem for most studies.

In a large, multiple-author site such as a university web site, it would not bereasonable to expect a crawler to find all pages. Individual academics may post course pages,for example, telling students their URLs but not linking to them. A bigger problem forcomparative studies of web sites is that there is no universal policy for linking to web content.For example, in some universities a list of links to all staff and student home pages ismaintained, but not in others. The decision about whether to create such a list could have abig impact upon the number of pages that a crawler is able to find.

Content crawling vs. URL crawling

Two important issues for the designers of web crawlers are whether duplicate pages should beignored and, if so, how duplicate pages should be defined and discovered. Some possiblealternative definitions of duplicate pages A and B are given below.

• A is a duplicate of B if A and B have the same URL.• A is a duplicate of B if the contents of A and B are the same, i.e. their HTML files are

identical.• A is a duplicate of B if the contents of A are very similar to the contents of B, using an

agreed measure of similarity.

The first definition is problematic because it is very common for web pages to have duplicatenames. For example, the page index.html in the cybermetrics web site can be retrieved eitherthrough URL http://cybermetrics.wlv.ac.uk/ or through the URLhttp://cybermetrics.wlv.ac.uk/index.html. This use of alias names for home pages is common.Within a site there may also be significantly different URLs for the same page because thesite has been reorganized. The above page can still be accessed from its old URLhttp://www.scit.wlv.ac.uk/~cml993/cybermetrics/. Some pages, or collections of pages, arealso copied wholesale to other sites in a process known as mirroring. Adapting Cothey (2005),the terminology URL crawler will be used for a crawler that does not check for duplicatecontent, only for duplicate URLs. A content crawler, in contrast, performs some checks in an


attempt to avoid duplicate pages. Both types of crawler are common. Commercial searchengine crawlers seem to be content crawlers (Broder, Kumar, Maghoul et al., 2000), as isSocSciBot, but personal web crawlers seem to be mainly URL crawlers.

For most link analysis purposes, content crawlers are preferable. The reason is that iflinks are being counted for any purpose then it does not make sense to count links in a pagetwice just because it has an alias URL. Commercial search engines also do not want to keepduplicate pages because they take up storage space and users would not often benefit frombeing shown alternative locations for the same content.

URL crawling can be an advantage in topological link analyses, when the pattern ofinterconnectivity of pages is studied, because removing duplicate pages can loose valuablestructure information, but it can also add unwanted additional structure. These ideas areillustrated for the simple system of pages shown in Figure 2.2.

Figure 2.2. A small collection of web pages

Figure 2.2 shows a collection of three pages, A, B, and C, but the page B has two URLs, band d. An URL crawler will ignore all page contents and just crawl all different URLs,irrespective of whether two URLs both point to the same page. Assuming that an URLcrawler can find all of the URLs a to d in Figure 2.2 (e.g. by following links from other pagesnot shown), then the structure that it will find is shown in Figure 2.3. This has two problems.For link counting, the number of links is incorrect: three links are shown when there areactually only two: one in page A and one in page B. From a structure perspective, URL d isincorrectly found to be not linked to by URL a; URL d's page is linked to from URL a,therefore, logically so should URL d.

Figure 2.3. The results of an URL crawl


A content crawl could result in two possible different crawls, depending upon which of URLsb and d are crawled first. If URL b was found and crawled before URL d, then when URL dwas crawled it would be rejected because its page is a duplicate of URL b's page. The resultwould be the diagram on the left of Figure 2.4, which would be correct. But if URL d wasfound first, then when URL b was crawled it would be rejected because its page is a duplicateof URL d's page. The result would be the diagram on the right of Figure 2.4, which would beincorrect from a topological point of view, because the link structure has been broken up.However, from a link counting point of view, the results are correct because in both cases twolinks are shown.

Figure 2.4. The two possible results of a content crawl

Although content crawls are an acceptable solution for link counting purposes, the idealsolution for a topological analysis would be to maintain a record of which pages areduplicates during a content crawl and then merge the duplicates before the topologicalanalysis. Figure 2.5 illustrates this solution for the Figure 2.2 system. A content crawl hasbeen conducted, but the information that URL b and URL d are equivalent has been recorded,whichever was crawled first. The link from page A can be seen to point to page B,irrespective of the order in which URLs b and d were crawled, because both URLs for page Bare known.

Figure 2.5. The result of a content crawl recording duplicate URLs

Despite the clear advantage for structure preservation of using a content crawl in combinationwith tracking duplicate URLs, it seems that topological analyses have used content crawlsalone, ignoring the problem of the inevitable structure changes caused (Broder, Kumar,Maghoul et al., 2000; Baeza-Yates & Castillo, 2001; Thelwall & Wilkinson, 2003). This maynot be a significant problem for large-scale analyses, but there is no evidence yet to decideeither way.


Content crawling

Content crawling faces two problems, knowledge of which can help the interpretation of theirresults. The first problem is in identifying when two pages are duplicates. A page thatcontains a text hit counter, for example, will be different every time it is retrieved (becausethe number will change) and so if it has two equivalent URLs then it will not be identified byan exact content-match check. For this reason, some content crawlers reject two pages if theyare similar but not identical. Fortunately, this kind of problem seems to be rare (Thelwall,2000) and so not a practical cause for concern in most cases. A subtler point, however, is theneed to keep certain kinds of page, even if they are duplicates. SocSciBot, for example, doesnot perform duplicate checks on "frameset" pages. These are typically very small and oftencreated in a standard template form by web page editors. Excluding these could result inentire sites not being crawled because the starting frameset page was excluded and its 'frame'links not followed.

The second important content crawling problem is a purely technical and practicalone: it takes time to do the comparisons. For example, if a new page is fetched at the end of acrawl of one billion pages then the duplicate checking needs to ensure that it is different fromthe billion previous pages. Appropriate computing techniques can enormously reduce thenumber and complexity of these checks (e.g., using a 'trie' data structure and checkingnumerical 'hashed' versions of the pages (Heydon & Najork, 1999)), but there is still asignificant time penalty for the checking. More fundamentally, large commercial searchengine crawlers are distributed over many computers so that there is not one single list ofdownloaded pages to check against (e.g., Brin & Page, 1998). It is not known howcommercial search engines cope with this problem, but one of AltaVista's scientists hasmentioned that AltaVista crawlers deliver pages that have been partially filtered for duplicatesand that AltaVista uses a second program to eliminate the remainder of the duplicates(Broder, Kumar, Maghoul et al., 2000).

The research crawler SocSciBot crawls one site at a time, and checks exhaustively forduplicates within each site. It does not, however, check for pages duplicated across differentsites. This is an issue if a site crawled contains a large mirror site. SocSciBots's solution is theadvance manual override 'banned list' feature that allows the operator to instruct it to avoididentified mirror sites. Alternatively the link processing software can post-process the linkdata to remove mirror sites.

Obscured links

Obscured links are links that are present in a web page but will not be found by a crawler. Thelink extractor part of a crawler is not capable of extracting all links from web pages becausesome can be stored in formats that are, in practice, impossible for them to decode. It followsthat the format in wnich a site's links are created can have a big impact upon how many pagesthe crawler can find. The following example illustrates one kind of obscured link.

Links in web pages in the early days of the web could only be in one simple format.They had to start with <a h re f =" and end with ">. The quotation marks were optionaland extra spaces (white space characters) could be inserted, but essentially a web crawler onlyneeded to search for occurrences of <a h r e f =" in the source of a web page to identify thestart of its links. It could reliably extract all links from web pages. With the introduction of


the web page programming language JavaScript, the URLs of links could be stored in webpages in a practically infinite variety of ways so that it was no longer possible to guaranteeextracting all links from a page. For example, a page with JavaScript could have its main linksin a menu, with the actual URLs embedded in the JavaScript program.

Obscured links are an important threat to the validity of link analysis data. If one ormore web sites in a collection use JavaScript extensively enough to prevent it beingeffectively indexed then it may not be possible to conduct an effective analysis of the set. Forlarge sites, that only use JavaScript on some pages, reasonable site coverage may still beobtained as long as the start page has links that can be extracted.

JavaScript is not the only enemy of effective crawling; others are Java, andShockwave. There is a lesson for web site designers here too: if they want their pages to bevisited by commercial search engine crawlers then they must ensure that there are enough'standard' links for crawlers.

Depth and other arbitrary limitations

Some crawlers are programmed to incorporate arbitrary limitations to avoid spider traps andfor other practical purposes. A common one is the depth limitation. The depth of a page in asite is the smallest number of steps required to get to the page by following links from thehome page. Thus, the site home page has a depth of 0, all pages linked to by the home pagehave a depth of 1, and so on. Crawlers may keep a track of page depths and have an arbitrarylimit, e.g. 10, as a precaution against spider traps. Common arbitrary limitations for crawlingare listed below.

• Maximum depth in a site• Maximum number of pages in a single site• Maximum URL length• Maximum page size• Maximum number of slashes in an URL

Automatically generated pages

Automatically generated pages are web pages that are created in response to web surfers'actions and do not exist before they are requested. These can cause problems to web crawlers.An example that everyone is familiar with is the search engine results page. When a query issubmitted to a search engine, it will search its database for relevant URLs and then build aresults page. This page will combine the query results with other information, such as thesearch engine logo and an advert. The web page it creates is a genuine web page with its ownunique URL (for most search engines) but if a crawler visited the search engine site, it wouldnot find the page because it was created in response to the query and then (effectively)instantly destroyed. Generalizing from this example, there is a lot of information that isavailable on the web but is not present in static web pages. It can only be accessed bysubmitting a query to a database. Examples include some library catalogs, business productdatabases and digital libraries. These are part of what has been commonly called the 'InvisibleWeb' or 'Deep web', the former description having been popularized by Sherman and Price


(2001). Those using data from a crawler have to accept its inability to find many types ofautomatically generated pages as an unavoidable limitation.

Interestingly, some information providers that operate web databases also generatelarge numbers of static pages containing key parts of their database, purely so that searchengines can index them. At the time of writing, Amazon.com was apparently creating a staticpage for each book in its database.

For link analysis purposes, the absence of databases from crawler data is unlikely tobe a problem, but there is a related issue that is a big concern. This is the use of automaticallygenerated pages in the core of a site, rather than just for databases. There are several webtechnologies that make it easy to do this, including PHP (PHP: Hypertext Pre-processor - arecursive acronym) and Microsoft's Active Server Pages (ASP). As an example of this inpractice, at the time of writing the main part of the University of Wolverhampton Web Sitewas stored in a database and all except one of the links on the home page were to ASP pages.The purpose of this arrangement was to allow all the content to be stored in a small databaseso that customized web pages could be built for visitors. For example, the email informationpage for staff had different information, but some overlap, with the email page for students.

The Wolverhampton site, and other sites using a similar format, are a problem forcrawlers because they are often programmed to avoid all automatically generated pages (e.g.,Chakrabarti, Joshi, Punera, & Pennock, 2002). Such pages can normally be identified by thepresence of a question mark in their URL. The reason that automatically generated pages areoften ignored is because they cause spider traps in some places on the web. A spider trap is atheoretically infinite collection of pages that a spider can never completely crawl. An exampleof an inadvertent spider trap is an online calendar that is able to create a web page for anygiven date, and includes in that page a link to a page for the following day. A crawler findingjust one page from this calendar can then continue forever requesting the calendar page of thenext day.

Commercial search engine crawlers used to avoid all URLs containing a questionmark because of the spider trap problem. At the time of writing, Google had relaxed thisrestriction in response to the proliferation of ASP and PHP sites. Presumably the Googlespider has a different method of avoiding spider traps. For example, it may follow links toURLs containing a question mark but not follow links from pages with a question mark intheir URLs. This would break the recursive cycle that is necessary for a spider trap.

Link analysis researchers should be aware that their web crawler may not be able tocrawl sites with automatically generated pages, or may crawl them only to a shallow depth.This is also true for the crawlers of commercial search engines. When using a crawler, theparameters used should be declared so that other researchers can judge the impact that theymay have on the results. An example of quite good practice is the following set ofapproximate crawl parameters, declared in an article.

• URLs with substrings in the following list are disallowed: ?, cgi-bin, &• URLs with more than some maximum number of path components (counted by

slashes) are disallowed• URLs are permitted to have some maximum number of characters.

(Chakrabarti, Joshi, Punera, & Pennock, 2002)


Ethical issues and robots.txt

There are mechanisms for web site owners to instruct web crawlers not to visit areas of theirweb site or to avoid the entire site. The most well-known of these methods is the robots.txtfile (Koster, 1994). A web site owner can create a file called robots.txt and use it to list theareas of their site that a robot should not visit. For example, if there is a subdirectory of thesite called /search that contains information that the site owner does not want to be crawled,then a two line robots.txt file would suffice for this, as shown below.

User-agent: *Disallow: /search

The first line indicates that the instruction applies to all crawlers. The star * is a wild card.The second line indicates that all URLs in the site starting with /search (after the domainname) should not be visited.

The User-Agent command allows the site owner to give different instructions todifferent search engines by specifying their name. For example, they might only wishcommercial search engines to index their site, and use the robots.txt file to ban all othercrawlers. The robots.txt file associated with a site must be accessible with an URL in the rootdirectory of the site, and the file name robots.txt. For example, the file for Microsoft's site isat http://www.microsoft.com/robots.txt and can be viewed in a web browser.

An ethical robot that visits a site should first check for a robots.txt file. If one exists, itshould carry out the instructions. In some cases this will mean that it is not allowed to crawlthe site at all. This can be a problem for research because it means that link data will not beavailable from a crawl of the site. At the time of writing, only one UK university, LiverpoolUniversity, had deployed a robots.txt file banning personal crawlers from its main site. Thecurrent version of this file can be found at http://www.liv.ac.uk/robots.txt.

The web page

The web page is not a clearly understood and defined entity. There are publicly availableseemingly authoritative definitions that incorporate fundamental disagreements (Thelwall,2002). The following are all plausible alternatives to five of the major components of adefinition, with illustrative alternatives.

• File formato An electronic file validly encoded in the language of the Web, HyperText

Markup Language (HTML), oro any file type accessible through a modern Web browser including non-HTML

formats such as plain text, PDF (Portable Document Format) and MicrosoftWord.

• Access mechanismo Requests made using the official 'port number' of the Web, 80, oro requests made using the official computer request language of the Web, the

HyperText Transfer Protocol (HTTP, as seen at the start of many URLs), or


o requests made using any mechanism available to a modern Web browser,including common non-web protocols such as FTP (File Transfer Protocol).

• Scopeo Public web pages that are available to all web users, oro public and private Web pages, including password protected pages and Intranet

and Extranet pages.• Permanence

o Static resources only, oro all resources, including dynamically-created Web pages such as search engine

results pages.• Compound pages

o A single file is a single web page, oro compound documents, such as those built up from separate files using the

HTML frameset feature also count as one single page.

The phrase "web page" is almost certainly used in practice with a variety of meanings,perhaps including most combinations of those mentioned above, and its actual meaning in anysituation will be dependant on the context and technical background of the communicators.Probably the need to be precise about exactly what constitutes a Web page is actually veryrare outside of counting exercises.

Web crawling summary

There is no practical and foolproof way to crawl all web pages in a large site. There are waysto improve coverage, such as querying search engines for additional pages, using link listsfrom previous crawls (if available) or guessing extra URLs (e.g. by testing each knowndirectory home page) but these do not solve the fundamental problem. Researchers have toaccept that link data from a crawler will give incomplete site coverage and find theoretical orempirical justification for the validity of their analysis. The issue should not be ignored,however. The phrase 'publicly indexable pages' can be applied to the collection of pages in asite that can be found by a crawler by following links from the home page and obeying ethicalissues. Using this phrase in reporting crawler coverage gives a convenient reminder that noclaim is made that all pages in the crawled sites have been downloaded.

SEARCH ENGINES

Many commercial search engines offer services that are useful for link analysis. The mostcommon is the "links to this URL" facility, which is designed for web site owners to find outwhich pages link to their site. Some search engines may allow more sophisticated queries,such as to search for all pages that link to any page in a whole site, or to search for all linksbetween two sites. Since search engines are free to use and require little computing skill tooperate, they are logical choice for some link analysis studies, particularly for small-scaleinvestigations. It is therefore important to have some idea of how commercial search engines


work and how this impacts upon the results that they report. Note that this discussion does notrefer to search directories, such as IDirectory and the Open Directory Project (DMoz.org).

It is useful to break up a commercial search engine into three different types offunctional unit: web crawler, data storage, and processor. Real search engines are far morecomplex than this but this level of detail is useful here. The crawlers fetch pages from the weband add them to one of the stores. The processors process the data in one or more of the storesand return results to search engine users. There may be several different types of crawler,store and processor, including some or all of the following.Crawlers Standard crawlers for visiting the whole web periodically (e.g. every month); fastcrawlers to revisit key news pages every hour, or even more frequently; variable speedcrawlers that use an algorithm to determine how frequently to revisit pages; topic crawlers toidentify pages covering a specific topic; geographic crawlers to ensure good coverage of aspecific region.Data stores Up to date full crawl data; copies of the full crawl data for efficiency; parts of thefull crawl data (e.g., just the links); part of the data processed to allow efficient access by theprocessors; old crawl data for backup purposes; partially complete crawl data; data from thespecial crawlers, including geographic information.Processors Standard processor for normal text queries; special fast processor for the firstresults page of normal text queries; special processor for advanced queries; special fastprocessor for the first results page of advanced queries; geographic processor;image/news/newsgroup/multimedia/product processor.

It follows from the variations described above that uniformity should not be expectedin search engines results. The same query repeated twice may change because the databasehas changed (particularly if the fast crawler has just found new relevant results), because adifferent version of the data was used, or because the main data set has just undergone one ofits periodic major updates (e.g. every month). Similarly, not all pages will be crawled with thesame frequency. It would be reasonable to expect a popular or frequently updated page to becrawled more often than average. The results of two related queries may also be inconsistentif they use different processors. For example, a text query may find a page that links to site Abut a query for links to site A might find no results. This could be the result of using differentdata sources or of processing the data in different ways.

Known biases

The above discussion covers important but quite abstract sources of inconsistency in searchengines. Some research has also found widespread unintentional bias in their coverage of theweb. The major search engines do not seem to have a specifically linguistic bias, but they dohave national biases. For example, sites in the USA are covered particularly well and sites inChina comparatively badly (Vaughan & Thelwall, 2004). This is indirectly caused by web siteage. New sites are less likely to be found by web crawlers because they are less likely to belinked. The sites of nations that are relatively new to the web are therefore less likely to becrawled.


Search engine ranking

Search engines do not return their results in a random order. They are normally ranked so thatpages most likely to be useful are delivered first. This is important for any link research thatneeds to take a random sample of links, e.g. for validation of experimental hypotheses.Factors taken into account in the ranking vary by search engine and search type but caninclude some or all of the following: how frequently the keywords searched for (if any) occurin the document and whereabouts they occur, how many inlinks the page has, how frequentlythe page has been updated, whether the page has been judged to be 'spam'.

For research that relies upon a random sample, ranking is not a problem if the searchengine returns all matching pages because one can sample from the full list. The problemoccurs when the results are ranked and the number of results exceeds the maximum numberthat the search engine will report. In this case true random sampling is not possible. A randomsample of the pages that were returned might represent a sample of the most important pagesfor the query, however, and this may still be useful.

THE INTERNET ARCHIVE

The Internet Archive deserves a special mention. It operates like a commercial search engineexcept that it keeps all old copies of web pages downloaded. This allows researchers andothers access to old information. This can be very useful, for example to give an earliestknown existence date for a site. If the site is in the archive then it clearly existed at the datereported but, unfortunately, its absence from the archive at an earlier date does not mean thatit did not exist, only that it was not found or indexed. See Thelwall & Vaughan (2004) formore about the archive crawler, including national biases similar to those for commercialsearch engines.

SUMMARY

Web crawlers operate by following links. Their main limitations are that they can only findpages that (a) they are allowed to visit, (b) are linked to or previously known about, (c) arelinked to in a way in which the crawler can extract from the linking page, and (d) match thecrawl parameters. Typically excluded will be: isolated pages, sites that only use JavaScript,Flash or Java links, pages that are the result of database queries, and the contents of web-accessible databases. Different crawlers will also have different parameters and give differentresults (Arroyo, 2004). In summary, 'crawling a web site' is a relative rather than absoluteconcept. The number of pages found will depend upon the site, the crawler design and theparameters under which the crawler is operating. Content crawling, which attempts to findand eliminate duplicate pages as far as possible, is preferable to URL crawling, but is still notideal for investigations into topological or graph properties of the web.

Search engines are complex machines designed to give good quality informationquickly to surfers. They are not designed to give high quality, accurate information toresearchers based upon uniform, unbiased coverage of the web. Despite this, they are still anattractive source of link information, as long as users are aware of their limitations. Theselimitations encompass, and extend, the limitations of web crawlers.


FURTHER READING

A good simple overview of a possible crawler design is given by Google's Brin & Page(1998) in their seminal paper. Chakrabarti's (2003) book also gives a readable description ofweb crawlers and applications. A survey of information about the functionality and operationof search engines can be found in Arasu, Cho, Garcia-Molina, Paepcke & Raghavan (2001),and an information science perspective of the implications of possible commercial searchengine design parameters can be found in Mettrop and Nieuwenhuysen (2001). See also thechapter in this book discussing the use of commercial search engines and the Internet Archivefor data collection.

The best place to look for current search engine issues is the web siteSearchEngineWatch.com, which tracks individual variations between the major searchengines and maintains up-to-date news bulletins.

REFERENCES

Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A. & Raghavan, S. (2001). Searching theweb, ACM Transactions on Internet Technology, 1(1), 2-43.

Arroyo, N. (2004). Evaluation of commercial and academic software for Webometricpurposes. WISER Technical report: CINDOC, Madrid.

Baeza-Yates, R. & Castillo, C. (2001). Relating web characteristics with link based web pageraking, In: Proceedings of SPIRE 2001, IEEE CS Press, Laguna San Rafael, Chile, pp.21-32.

Bar-Ilan, J. (2001). Data collection methods on the web for informetric purposes: a reviewand analysis. Scientometrics, 50(1), 7-32.

Bar-Ilan, J. (2004). The use of web search engines in information science research, AnnualReview of Information Science and Technology, 38, 231-288.

Brin, S., & Page, L. (1998). The anatomy of a large scale hypertextual Web search engine.Computer Networks and ISDN Systems, 30(1-7), 107-117.

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., &Wiener, J. (2000). Graph structure in the Web. Computer Networks, 33(1-6), 309-320.

Chakrabarti, S. (2003). Mining the Web: Analysis of hypertext and semi structured data. NewYork: Morgan Kaufmann.

Chakrabarti, S., Joshi, M. M., Punera, K. & Pennock, D. M. (2002). The structure of broadtopics on the Web, WWW2002, Available: http://www2002.org/CDROM/refereed/338/

Cothey, V. (2005, to appear). Web-crawling reliability, Journal of the American Society forInformation Science and Technology.

Garrido, M., & Halavais, A. (2003). Mapping networks of support for the ZapatistaMovement: Applying Social Network Analysis to study contemporary socialmovements. In: M. McCaughey & M. Ayers (Eds.), Cyberactivism: online activism intheory and practice (pp. 165-184). New York: Routledge.

Heydon, A., & Najork, M. (1999). Mercator. A scalable, extensible Web crawler. World WideWeb, 2, 219-229.

Koster, M. (1994). A standard for robot exclusion. Accessed April 26, 2004, available:http://www.robotstxt.org/wc/norobots.html


Mettrop, W., & Nieuwenhuysen, P. (2001). Internet search engines: Fluctuations in documentaccessibility. Journal of Documentation, 57(5), 623-651.

Pant, G., Srinivasan, P., & Menczer, P. (2004). Crawling the web, In: M. Levene & A.Poulovassilis (eds.), Web Dynamics, Berlin: Springer.

Park, H.W., Barnett, G.A., & Nam, I. (2002). Hyperlink-affiliation network structure of topWeb sites: Examining affiliates with hyperlink in Korea. Journal of the AmericanSociety for Information Science and Technology, 53(1), 592-601.

Sherman, G. & Price, M. (2001). The invisible web. Chicago: Independent Publishers Group.Thelwall, M. & Wilkinson, D. (2003). Graph structure in three national academic Webs:

Power laws with anomalies. Journal of the American Society for Information Scienceand Technology, 54(8), 706-712.

Thelwall, M., & Vaughan, L. (2004). A fair history of the web? Examining country balance inthe Internet Archive. Library & Information Science Research, 26(2), 162-176.

Thelwall, M. (2000). Commercial web sites: Lost in cyberspace? Internet Research, 10(2),150-159.

Thelwall, M. (2001). A web crawler design for data mining. Journal of Information Science,27(5), 319-325.

Thelwall, M. (2002). Methodologies for crawler based web surveys, Internet Research:Electronic Networking and Applications, 12(2), 124-138.

Vaughan, L. & Thelwall, M. (2004). Search engine coverage bias: evidence and possiblecauses. Information Processing & Management, 40(4), 693-707.

The Theoretical Perspective for Link Counting 23

THE THEORETICAL PERSPECTIVE FORLINK COUNTING

OBJECTIVES

• To introduce theoretical and practical issues for link counting.• To review common link data cleansing techniques.

INTRODUCTION

Counting links is at the heart of information science link analysis but unrestricted linkcounting is often not the most effective approach. In order to get results that are as useful aspossible, it is necessary to be selective about which links to count and how to count them.This chapter introduces a theoretical perspective for link counting so that alternative methodscan be assessed and compared. Links that do not fit the theoretical perspective are labeled asanomalies and two strategies are described for eliminating them or reducing their impact. Thefirst strategy is manual filtering and the second is the use of different methods ofconceptualizing the basic unit of web content, replacing the web page. This replacementallows different counting techniques to be implemented, reducing anomalies.

THE THEORETICAL PERSPECTIVE FOR LINK COUNTING

The philosophy underlying link counting is not normally made explicit in research papers, buta discussion of theoretical perspectives is important to guide the selection of link countmethods in different contexts. Links to a page are valued because, at an abstract level, eachone represents an endorsement of the target page by the author of the source page. Forexample, Brookes (2004), discussing Google, describes links as mass democracy, a plebiscite:Google see pages as important if many pages (people) link to them.

"Intuitively, pages that are well cited [linked to] from many places around the web areworth looking at" (Brin & Page, 1998)

3


A metaphor of links as votes is used by Google to explain why more highly inlinked pages arelikely to be more important (Google, 2004). Google nevertheless acknowledges the fact thatall links are not equal.

"Also, pages that have perhaps only one citation [inlink] from something like theYahoo! homepage are also generally worth looking at" (Brin & Page, 1998)

Google actually embraces differences in importance between links, and attempts to assignhigher weights to links from more important pages (as judged by inlink 'votes')- For theinformation science approach, however, assigning different weights to links for countingpurposes is not desirable because of its complexity, although it has been tried (Thelwall,2003b). An opposite approach is recommended: to take steps to ensure that the links countedare as equal in value as possible. Ideally, all links would be created with the same care andattention, and be worth the same. Stemming from this rationale, the theoretical perspectivebelow is useful to help make judgments about which methods improve link counting from aninformation science perspective.

Theoretical perspective for link countingIn order to gain the best results from information science link research,all links counted should be created

• individually and independently,• by humans,• through equivalent judgments about the quality of the

information in the target page.Additionally, links to a site should target pages created by the site owneror somebody else closely associated with the site.

The perspective describes the ideal rather than reality. There are many cases when all keyaspects are not true, e.g. created by humans, individually, independently, and through qualityand relevance judgments. The purpose of this theoretical perspective is for assessing themerits of different link counting methods by allowing them to be compared to the ideal. Theultimate objective is to develop practical counting methods to give the most useful results.

ANOMALIES

The theoretical perspective can be used to label links as anomalies if they do not fit. The factthat most links may be classified as anomalies does not undermine the basis of thischaracterization.

Historically, the first identified source of anomalies appears to be internal site links(Ingwersen, 1998; Smith, 1999), even though these probably form the majority of web links.There are two main reasons for this. First, internal site links may have a primary purpose ofnavigation: allowing a user to move around the site. Intuitively, creating a link to a different


site is a much stronger endorsement of the target than internal site links. The second reason isthat internal site links may not be created by humans but may be automatically replicatedaround a site, perhaps in the form of a standard navigation bar. Although the existence of apage in a standard navigation bar will often mean that is one of the most important in a site, itconveys no information at all about how useful the contents of any of the site's pages are tousers. A huge site full of useless information but with a standard navigation bar on each pagewould generate a high inlink count to the main pages. Google's creators Brin and Page appearto have been the first to recognize this, citing the possibility to ignore internal site links intheir patent (Page, 2001), filed January 9, 1998. Many link analysis approaches ignore allinternal links (variously defined) as a simple, but apparently effective anomaly eliminationtechnique (e.g., Kleinberg, 1999; Ingwersen, 1998).

Replicated links are not always site self-links, they can also be site outlinks. This isparticularly true for replicated credit/acknowledgement links. For instance, some web pageauthoring software tools automatically add an acknowledgement link to every page that theycreate. Such links are anomalies because humans do not create them, and they are not createdindividually and independently. They are a significant problem because there may be so manyof these links that they dwarf other types of link. Acknowledgement link targets appeared atthe top of a list of the most commonly targeted UK university pages in 2001 (Thelwall,2002a), with the biggest single cause being credit links on pages created by a Cambridge-based free web log file analyzer program that produces reports in the form of sets of webpages. Another cause of replicated acknowledgement links is the web sites of inter-universityresearch groups with a standard 'acknowledgement bar' on the top of every page linking tothe home page of each participating institution. Web authoring software such as MacromediaDreamweaver makes it very easy to clone such link bars. Anomalous inter-site links cannot beavoided by the policy of excluding internal site links, but alternative techniques are discussedbelow.

Non-replicated automatically generated links can also produce anomalies. This canoccur when two web databases on different sites are highly interlinked, or when an author onone site creates many 'shortcut' links to database content on another site. An example ofinterlinked databases is the biochemistry molecule structure databases at the universities ofWarwick and Cambridge in the UK (Thelwall, 2002b). These databases are present on theweb in the form of tens of thousands of pages, each giving information on one molecule. Ifpages in two of these sites cover the same molecule then they link to each other. The netresult is tens of thousands of links between the sites. Note that the links are not replicated,they all target different pages. Nevertheless they should not be seen as being as important asindividually created links, and are anomalies because they violate the 'individually andindependently' part of the theoretical perspective.

Finally, sites hosted for other organizations can be a problem. For example, largemirror sites may attract many links, creating an anomaly since the content of the web site isnot created by the institution mirroring it. Examples of this are the SunSite minors found inmany universities throughout the world. Common sources of anomalies are summarized inTable 3.1.

MANUAL FILTERING AND BANNED LISTS

One way to remove anomalous links is to manually filter them out. Search engines appear tobe pioneers of this approach, maintaining lists of sites that their robots should not visitbecause they have been associated with link spam (link replication designed to gain a higherposition in the list of results returned by a search engine), text spam (also to gain an 'unfair'advantage), and perhaps also undesired content, such as spider traps, illegal, adult, orenormous sites. Those who use link data from search engines have to accept that it may havebeen affected by manual filtering. The filters are not made public, unless by accident, and sothere is no way of telling how much manual filtering has effected any given set of results.

Manual filtering by researchers is also possible, and is an established technique. Forexample, the crawler SocSciBot has the ability to read in a text file of URLs and toautomatically avoid any URL matching those in the list. The 'banned lists' produced foracademic research crawling with a variant of SocSciBot typically contain spider traps, mirrorsites and content not produced by the organization owning the web site, including mini websites hosted for other organizations. The banned lists can be viewed in the online databasesdescribed in chapter 18.

Manual filtering is useful for eliminating anomalies, and is essential to ensurecomprehensive coverage of large sites because some spider traps can only be identifiedmanually. Commercial search engines, in contrast, probably have heuristics to avoid spidertraps in large sites. Nevertheless, manual filtering is a potential methodological problem foracademic research. Any situation in which researchers modify data is undesirable, but, inpractice, data cleansing is very common. For example, there is a wide range of differentstatistical techniques for identifying and eliminating anomalies in data sets.

In order to avoid unintentionally biasing a link data set, it is important to have clearcriteria for identifying links as anomalies. For example, rules could be designed to excludeboth spider traps and mirror sites. In research, the rules should be reported so that otherresearchers can assess their impact on the findings. A further issue is raised here: that of theneed for human choice in the selection of anomalies, even ones that are clearly defined. Aswith any classification exercise, it is inevitable that different people would not make the samechoices. Ideally, the decision would be made through a consensus-based technique involvinga group of independent people. This would give a high degree of academic credibility to theresults. Unfortunately, this seems to be impractical for all except the smallest sites, and noprevious research has used this tactic. For instance, the Wolverhampton Cybermetrics groupuses a compromise approach, stating why URLs are banned but allowing the judgment to be


Table 3.1. Common link count anomaliesSource of anomaly Reason for anomalySite selflinks Target page quality judgments are different

from those for intersite linksReplicated links Computer-created and/or not created

individually and independentlyInterlinked databases Computer-created and/or not created

individually and independentlyMirror sites Authors are not associated with the host site


made by a single person. This method is made accountable by publishing the complete list ofbanned URLs on the group's web site so that other researchers can challenge it.

An approach that has not been adopted so far, but is, in principle, a good idea, iscontent analysis (Neuendorf, 2002). This is a method designed to help groups of classifiers tomake reliable and repeatable judgments in situations of uncertainty. One aspect of contentanalysis is the creation of prescriptive lists of acceptable reasons for making a judgment. Ithas been used to help classify link types (see chapter 4) but not, as yet, for filtering outanomalies.

ALTERNATIVE DOCUMENT MODELS

Although manual filtering is useful for identifying mirror sites and spider traps, it is not apractical tool for finding other kinds of anomalies, particularly replicated links, unless for asmall collection of sites. A different approach for dealing with replicated link anomalies is tocount links in new ways that reduce or eliminate their effect. An example of this strategy hasalready been mentioned: eliminating all internal site links. The Alternative Document Models(ADMs) are an extension of this idea to deal with intersite links. They have a separate, butclosely related rationale, deriving from a theoretical examination of the concept of adocument and how this has been traditionally operationalized in counting exercises. Thefollowing subsection discusses the web site as the best-known multiple-page unit of contenton the web, leading to the ADM concept.

Web sites and web documents

The web site has been loosely defined in chapter 1 as a self-contained collection of one ormore pages with a consistent theme. Although the usefulness of the phrase "web site" stemsfrom the vagueness of the definition, which allows it to be used to describe vastly differentcollections of pages, is helpful to identify some simple features that many will possess. Websites are probably most frequently associated with domain names. For example the Googlehome page is at http://www.google.com/ but its web site would probably be recognized as allURLs with domain names ending in google.com. Web sites can also be subsites of largersites, often consisting of URLs with identical domain names and the same first fewdirectories. For instance, all URLs beginning with http://www.scit.wlv.ac.uk/~cml993/ wouldprobably be accepted as being the author's web site, /-cm 1993/ being the web folder allocatedto the author by his department. In practice, web sites can be identifiable even if based onmultiple domains as long as there are sufficient cues in the interface. Interestingly,Microsoft's FrontPage web editor, describes collections of related pages as just a "Web" andrecognizes standard types of "Web" such as the "Personal Web" and the "Customer SupportWeb". This may be an attempt to avoid confusing users who may have a differentunderstanding of the more standard term "web site". In this book we adopt the term'document' with a similar motivation.

The document, whilst recognizable to information scientists and the general public, is anawkward concept for which to give a general web definition. Recognizable genres of print andelectronic documents are commonly found in the web, but so also are collections of pages thatdo not have an offline equivalent.


• Print media genre publications: e-journals, e-books, online conference proceedings,digital libraries.

• Information retrieval: online library catalogs, general search engines.• Electronic media: photographs, films, music, even three-dimensional environments such

as archaeological reconstructions.• Informal communication: email archives, bulletin boards, real-time discussion forums.

The key issue for link counting is the relationship between documents and pages. Printdocuments are normally recognizable through being collections of pages physically boundtogether. This binding can also happen on the web through links between pages. For example,a Microsoft PowerPoint slide show can be automatically converted into a set of interlinkedweb pages. Books can be found on the web too, some as a single huge page, others inthousands of interlinked small pages. Links are not a sufficient determinant of documentmembership, however, because pages can interlink that are clearly not part of any kind ofdocument.

New web genres can be harder to classify into documents. Presumably a frequentlyasked questions list is a coherent document, irrespective of whether it is a single file or brokenup, or with single or multiple authors. But if, say, a small research group web site containsmultiple types of resource, perhaps a home page, a background information page, twenty PDFproject findings and papers as well as a links list, is this a single document or multipledocuments, and if the latter, how many?

For information science link counting, an objective is to be able to model the averagebehavior of web authors following the above theoretical perspective for link counting, and soone logical solution is to allow one link per recognizable coherent body of work. Taking thisinto account in addition to the concerns above, the following is proposed as a deliberatelyloose working definition.

A web document is a collection of pages with a consistent theme produced by asingle author or collaborating team. It may consist of any number of electronicfiles retrievable over the web using a modern browser.

The difference between this and the definition of a web site is that the 'self-contained'requirement is dropped: individual pages will be allowed to be documents, even if they areclearly part of a larger collection. The definition emphasizes the possible inclusion of non-HTML web pages, such as PDF files and Microsoft Word documents.

Due to its imprecise nature, the web document definition could be interpreted indifferent ways. The need for a systematic implementation of the definition could thereforelead to different solutions. For example, documents could be equated with: small web sites orsmall subsites of a larger site; or with individual files. The problem remains of how to makethe definition more prescriptive in order to implement it for counting purposes. Two possiblesolutions are to allow panels of human experts to group pages into documents, or to developan automated heuristic for page aggregation, accepting that it represents a simplistic model.The first is likely to be impractical for anything other than small-scale link countinginvestigations, so the second option is pursued.

Two different types of page aggregation heuristics are possible. The first is to developalgorithms to automatically merge 'similar' web pages into documents based upon both their


content and enveloping link structure. The second possibility, formalized below as a series ofADMs, is to develop simple URL-based heuristics to automatically merge web pages intoconceptual documents. Many web editors, including Microsoft FrontPage, have the defaultsetting of storing all related documents in the same directory, perhaps using subfolders forauxiliary files such as images. This makes the directory a plausible level of aggregation.Another natural level of aggregation is the domain name: all URLs sharing a domain name. Atotal of four different document identification heuristics are defined below.

ADMs and standard ADM counting

The ADMs are heuristics for grouping pages together into conceptual documents, using URLsto assign pages to documents. The purpose is to reduce the extent to which anomalies occur inweb linking behavior at the page level by assigning similar pages to the same document, sothat related links created in similar pages are only counted once. There are four main ADMs,which aggregate web pages at the page, directory, domain and site level, as described below.

• Page/File Each separate file is treated as a document for the purposes of extracting links.o URLs are truncated before any internal target marker '#' character found to

avoid multiple references to different parts of the same page, and then eachunique link URL is treated as a separate document.

• Directory All files in the same directory are treated as a single document.o URLs are truncated to the position of their last slash.

• Domain All files with the same domain name are treated as a single document.o URLs are reduced to their domain name.

• University/Site All files belonging to a university or other defined web site are treated as asingle document.

o URLs are reduced to the part common to all web pages in the site.

The results of file, university and domain counting models are given in Table 3.2, based uponthe Bjorneborn diagram in Figure 3.1. In the picture, all pages link to all other pages (notdrawn to reduce complexity), including themselves. Site self-links are always ignored,whichever metric is being calculated. The directory model follows the same principles but isnot shown for reasons of space.

To illustrate the counting process, consider the links from pages A and B inphys.cam.ac.uk (Cambridge University: physics) to pages X and Y in phys.ox.ac.uk (OxfordUniversity: physics). With the standard page counting model, there are four such links: linksfrom page A to pages X and Y, and links from page B to pages X and Y. Hence the totalnumber of page ADM links from phys.cam.ac.uk to phys.ox.ac.uk is four. But the domainADM would count only one link from phys.cam.ac.uk to phys.ox.ac.uk since all four of theselinks are from the domain phys.cam.ac.uk to the domain phys.ox.ac.uk. These four links,having the same source and target document, are now mutual duplicates.


ed.ac.ukFigure 3.1. Three universities in which all pages connect to all other pages (links not shown).Domain model links are fine lines and site model links thick lines. All links are bi-directional.

Table 3.2. Link counts for Figure. 3.1.

A key issue for the use of ADMs is balancing anomaly elimination with loss of data. A higherlevel of aggregation will eliminate link replication anomalies, but will loose data. Forexample, in Figure 3.1, the university ADM only allows one link between pairs ofuniversities, even though there are 9 page links from Cambridge to Oxford, and 3 page linksfrom Edinburgh to Oxford. The count of 1 link would be the correct solution, if in both casesall the links from the same university are anomalies, i.e. violate the theoretical perspective forlink counting. But if the links are not anomalies, then the additional links would represent lostinformation: loosing the fact that Cambridge links to Oxford three times more than Edinburghdoes. For university-level data, testing with the UK indicates that the directory and domainADMs are both appropriate, but the page ADM leaves in too many anomalies and theuniversity ADM looses too much data (Thelwall, 2002b). See chapter 16 for moreinformation on choosing the best ADM for any given data set.

ADM range counting models

An alternative to the four standard ADM methods of link counting is the set of range methods(Thelwall & Wilkinson, 2003). The range metrics use the same set of document models butcount only the number of different target URLs, ignoring multiple links from oneuniversity/site to the same document at another. In effect, the range metrics apply theuniversity/site document model to link sources, and another selected ADM to link targets.They are equivalent to the above four definitions except with the university/site documentmodel applied to document sources in all cases. The rationale for the range counting is thatwithin a single large site, such as a university web site, the creation of links is not

Links from Links fromModel cam to ox ed to ox Total inlinks to oxPage/file 9 3 12Domain 4 2 6University 1 1 2


CHOOSING LINK COUNTING STRATEGIES

The objective of choosing a link counting strategy is to make the results as meaningful andvalid as possible. This does not mean that all of the techniques described in this chapter mustalways be applied: often this will be impossible. The choice of link counting strategy dependsupon the origin of the data and the characteristics of the web sites being studied. If acommercial search engine or the Internet Archive is used, then the strategy is currentlylimited to manual filtering of the results. This may be impossible if there is a large number oflinks to check. The filtering could take the form of either removing individual anomalouslinks, or, to cope with many replicated links, revising query formulations to excludeanomalies. For example, if a mirror site were discovered, then it may be possible to modifythe original search engine query to exclude all links to or from the mirror site.

If a specialist crawler like SocSciBot is used for data gathering, then its support forthe manual exclusion of spider traps and mirror sites can be used. For data sets that are not toolarge, manual filtering can be extended to the link structure files created by the crawler. Thenext stage is to select the most appropriate ADM. Tools and techniques for ADM selectionare discussed more in chapter 17. There are two quantitative techniques that can be used. Ifthere is an external data set that is expected to associate with link counts (e.g. researchproductivity for university web sites) then link counts can be obtained for each ADM and theone with the highest correlation can be selected. Assuming that there is a genuine relationshipbetween the link counts and external data, the correlation test works as a method for selectingwhich ADM gives the most information from the link counts. If no external data source isavailable then an alternative quantitative approach is TLD spectral analysis (Thelwall, 2005).See chapter 17 for information about TLD spectral analysis.

independent, even across very different parts of the site. For example, many personal pages ina single university may link to the same target page because the authors have all attended auniversity-wide web page authoring course that publicized it.

Table 3.3 illustrates the results of three range counting models applied to Figure 3.1.Note that the university range model is identical to the standard university model.

Table 3.3. Range link counts for Figure. 3.1.Links from Links from

Model cam to ox ed to ox Total inlinks to oxFile range 3 3 6Domain range 2 2 4University range 1 1 2


Choosing link counting strategies

Using a crawler (e.g. SocSciBot)

1. Manually filter out mirror sites and spidertraps during the crawl.

2. If the data set is not too large, manuallyidentify and filter out mirror sites,replicated links and other anomalies fromthe crawled link data.

3. Select an ADM using one or more of thefollowing:a. Correlation tests with non-web

datab. The results of previous

research with ADM data andsimilar sites

c. TLD spectral analysisd. A rational argument as to

which is likely to be the best,based upon the type of sites

Using a search engine or the Internet Archive

1. If the data set is not too large, manuallyidentify and filter out mirror sites,replicated links and other anomalies fromthe search results.

SUMMARY

Link count research often relies upon an unstated assumption that all links are equal in valuewhen in practice they are not. The adoption of a theoretical perspective for link countingallows links violating it to be regarded as anomalies. One consequence is the need to excludeinternal site links. In many situations, particularly for research purposes, this will not beenough and significant anomalies will remain. Two methods have been suggested for copingwith this. The first is manual filtering, which is practical for small data sets. The second is theuse of ADMs, which can be fully automated. These can eliminate anomalies in some casesand reduce their impact on the data in others. In practice, a degree of manual filtering is alsooften used to protect web crawlers from spider traps in larger sites that are beingcomprehensively crawled, and so a combination of the two approaches is also common.

FURTHER READING

Manual filtering and banned lists are described in Thelwall (2003a). A very brief descriptionof commercial search engine filtering can be found in the paper of Broder Kumar, Maghoul etal. (2000).

The theory behind Alternative Document Models was developed, but not named, inThelwall (2002b) and Thelwall and Wilkinson (2003), which have both been used and


adapted in this chapter. The name subsequently given was Advanced Document Models, inThelwall and Harries (2003), but this was later changed to Alternative Document Models.This is very similar to an idea developed earlier by Bjorneborn (2001). Some Google researchhas also used a link counting model equivalent to one of the ADMs (Bharat, Chang,Henzinger, & Ruhl, 2001).

Figure 3.1 is a Bjorneborn diagram. See Bjorneborn (2004) for a complete system ofgraphical representation of web linking.

REFERENCES

Bharat, K. Chang, B. Henzinger, M. & Ruhl, M. (2001). Who links to whom: Mining linkagebetween web sites. In: Proceedings oflCDM 2001, pp. 51-58.

Bjorneborn, L. (2001). Shared outlinks in webometric co-linkage analysis: a pilot study ofbibliographic couplings on researchers' bookmark lists on the Web. Royal School ofLibrary and Information Science.


Brin, S., & Page, L. (1998). The anatomy of a large scale hypertextual web search engine.Computer Networks and ISDN Systems, 30(1-1), 107-117.http://w w w7 .scu.edu. au/programme/fullpapers/ 1921/coml921 .htm

Broder, A., Kumar, R.; Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A. &Wiener, J. (2000). Graph structure in the Web. Computer Networks, 33(1-6), 309-320.

Brookes, T. (2004). The nature of meaning in the age of Google. Information Research 9(3),paper 180. http://informationr.net/ir/9-3/paperl80.html

Google (2004). Our search: Google technology, http://www.google.com/technology/Ingwersen, P. (1998). The calculation of Web Impact Factors. Journal of Documentation,

54(2), 236-243.Kleinberg, I. (1999). Authoritative sources in a hyperlinked environment. Journal of the

ACM, 46(5), 604-632.Neuendorf, K. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage.Page, L. (2001). Method for node ranking in a linked database, United States Patent

6,285,999.Smith, A.G. (1999). A tale of two web spaces: comparing sites using web impact factors.

Journal of Documentation, 55(5), 577-592.Thelwall, M. & Harries, G. (2003). The connection between the research of a university and

counts of links to its web pages: An investigation based upon a classification of therelationships of pages to the research of the host university, Journal of the AmericanSociety for Information Science and Technology, 54(7), 594-602.

Thelwall, M. & Wilkinson, D. (2003). Three target document range metrics for universityweb sites. Journal of the American Society for Information Science and Technology,54(6), 489-496.

Thelwall, M. (2002a). The top 100 linked pages on UK university Web sites: High inlinkcounts are not usually directly associated with quality scholarly content. Journal ofInformation Science, 28(6), 485-493.


Thelwall, M. (2002b). Conceptualizing documentation on the web: An evaluation of differentheuristic-based models for counting links between university Web sites. Journal of theAmerican Society for Information Science and Technology, 55(12), 995-1005.

Thelwall, M. (2003a). A free database of university Web links: Data collection issues.Cybermetrics, 6(1). Available:http://www.cindoc.csic.es/cybermetrics/articles/v6ilp2.html.

Thelwall, M. (2003b). Can Google's PageRank be used to find the most important academicWeb pages? Journal of Documentation, 59(2), 205-217.

Thelwall, M. (2005). Data cleansing and validation for Multiple Site Link Structure Analysis.In: Scime, A. (Ed.), Web Mining: Applications and Techniques. Idea Group Inc., pp.208-227.

Interpreting Link Counts 35

INTERPRETING LINK COUNTS: RANDOMSAMPLES AND CORRELATIONS

OBJECTIVES

• To demonstrate the need to classify random samples of links to validate interpretations oflink counts.

• To review random sampling techniques.• To introduce correlation testing.

INTRODUCTION

Link counts are interpreted and described in different ways by different authors, which isillustrative of the confusion in the meaning that should be attributed to them. Counts of linksto a web page, web site, or other aggregation of web pages were originally described asmeasures of impact (Ingwersen, 1998) in information science. The concept of impact derivedfrom an analogy with citation analysis, which uses journal article citation counts as ascientific impact measure. In bibliometrics, a considerable body of research has been directedat validating citation counts, and some lessons learned can be transferred to link analysis. Inthis chapter, the issue of identifying and assessing interpretations of link counts is addressed.This is coupled with the need for initial investigations into whether there are enough links tomake any particular investigation worthwhile. Both random sampling and correlation testingare discussed, first as part of a preliminary investigation into a proposed research question,and then as part of the validation of interpretations of link counts.

INTERPRETING LINK COUNTS

There is no uniformity over the meaning of links. In addition to impact, alternative terms todescribe inlink counts or closely related calculations include visibility (Vreeland, 2000), trust(Davenport & Cronin, 2000; Palmer, Bailey, & Faraj, 2000), worth (to be looked at) (Brin &

4


Page, 1998), quality (Hernandez-Borges, Macfas-Cervi, Gaspar-Guardado et al, 1999), ortopic authoritativeness (Kleinberg, 1999). Links between web sites or pages have been usedas indicators of topic similarity (Kleinberg, 1999), aspatial proximity (cf. Park & Thelwall,2003), international information flows (cf. Park & Thelwall, 2003), relationships in a networkof organizations (Garrido & Halavais, 2003) and business connections (Park, Barnett & Nam,2002). Of course, the web is large enough that valid interpretations of link counts could varyenormously by context. For example, business relationship strength might be a reasonableinterpretation for links between a particular set of commercial web sites, and information flowmight be appropriate for a different set of sites. The implication of all of these different butreasonable interpretations is that researchers should not assume without checking that linksrepresent anything: they should be investigated to find out how they should be interpreted.This was first recognized by Smith (1999), who realized that if links could be countedreliably, then the main problem was of deciding what link counts signify.

The importance of identifying the significance of link counts is underlined by theexistence of links that have no real use value. They may be created as a technological exerciserather than for any communication function (Thelwall, 2003). An example is theacknowledgement link often found in an academic's home page, pointing to the universityfrom which they obtained their degree. Surely such links neither endorse the target page orsite contents nor represent online communication. It seems likely that many links are createdbecause the author was either learning web page authoring and wanted to practice linkcreation, or because linking to organizations' web sites if you mention them in your web pageis 'the right thing to do', i.e. genre following. The existence of a proportion of such links doesnot mean that link counts cannot have a meaningful interpretation, as long as the proportion ofirrelevant links is not too high.

The solution to the problem of deciding what link counts signify in a particular contextcan be coupled with the problem of checking that there are enough links to make aninvestigation worthwhile and solved with a pilot study. Procedures for this are discussed inthe next section.

Closely related to the need to identify an appropriate interpretation for link counts in agiven study is the need to assess that interpretation after it has been established. In socialscience terminology, the assessment of link counts interpretations is a validity issue. If adescription such as 'impact' or 'visibility' is used then evidence must be presented to showthat this genuinely reflects the normally accepted meaning of the term. For example, if linkcounts are described as an impact measure, the validity issue is to establish that 'impact' is areasonable word to use to describe that which link counts measure. Alternatively, if a study ofinternational links draws conclusions about international collaboration then this needs to besupported by evidence that the international links are related to collaboration. Validityassessment is a necessary step to support any conclusions drawn from data. The proposedprimary solution to the validity/significance issue is to take a random sample of links andclassify them in order to gain an overview of the different types of links present and theirapproximate proportions in the data set.

A secondary type of validity assessment, which is also an investigation of reliability,is the correlation test. The reliability of a measure normally refers to the extent to which thesame results can be gained from repeated trials. In the context of web links, it is moreappropriate to think of it as the extent to which link counts are determined by undesiredfactors outside of the researcher's control. For example, if link analysis data were obtained


from a search engine that did not crawl a key site, then the absence of links from that sitewould be seen as a reliability issue.

THE PILOT FEASIBILITY AND VALIDITY STUDY

Every link analysis research project should include a pilot feasibility and pilot validity studyin its early stages. The purpose of the studies is to assess whether there will be enough links toproduce interesting results, and, if so, whether the types of links found are broadly consistentwith the research goals. The types of links found must be assessed to see whether they matchan interpretation of link counts that is appropriate for the desired outcomes of the research. Apilot study can save a lot of wasted effort if a project is unsuitable because of the quantity orquality of the links found. For example, a pilot study of the interconnectivity of flower shopweb sites would probably reveal that very few flower shops link to each other and hence theissue was not worth addressing. Alternatively, a study of collaboration between major ITcompanies may find that the company web sites did interlink, but that the links were rarelycaused by collaboration initiatives, so were not valid as a data source for collaboration.

The objective of pilot studies is to get quick approximate information about the twoissues discussed above: quantity and validity. To address the quantity concern, a small sampleof web sites should be taken and then appropriate statistics collected (e.g., inlink counts, orcounts of links between pairs of web sites). If it is found that there are too few links then twopotential responses are to increase the scale of the study to focus only on larger web sites, orto forsake the study if it is clearly not going to be feasible. The number of link counts thatshould be sampled cannot be prescriptively determined but, as a guideline, 40 should befeasible and should give a good idea of the likely overall results. Researchers on large projectsmay want to quadruple this figure for additional security. In any case, the sample should bechosen to be representative of the population in order to get as accurate idea as possible of thefull data set. The figure 40 is sufficient to ensure that if at least 50% of the pages containrelevant information then there is an 80% chance that in the full sample at least 30% willcontain relevant information. In contrast, 160 would be sufficient to ensure that if at least 50%of the pages contain relevant information then there is an 80% chance that in the full sampleat least 40% will contain relevant information. This information comes from a standard powertest for a proportion (Normal distribution), available from the statistical software Minitab. SeeHowell (2002, chapter 8) for more on power in statistical tests.

The validity pilot should visit a sample of links to assess their type and determine howwell they fit the goals of the research. A suitable number would be 40 again, but more if thisnumber of links did not give a clear impression of link types. The links visited should bechosen at random as far as possible, and classified from a visit to the source page and, ifnecessary, also to the target page. If choosing links to classify from search engine resultspages, it should be remembered that the results may be ranked so that the top few pageswould not be representative of the overall results. The links should therefore be sampled atrandom from all of the available results pages.

In order to get a good idea of the range of types of links present, they should beclassified using an appropriate schema. This schema may be based upon one found in apublished similar article, or may be entirely new. The categories chosen should typicallydepend both upon the purpose of the link analysis and upon the number of links found in eachpreliminary category. New categories may be added to split up existing ones, hencereclassifying earlier links. To give an example, suppose that a research goal was to identify


collaboration between companies and banks based upon hyperlinks from company to bankweb sites. A very simple initial classification scheme could be a binary split into linksreflecting any kind of collaboration, and links reflecting anything else. If many 'collaboration'links were subsequently discovered then it would make sense to subdivide this category in ameaningful way, in order to get a more detailed impression of the types of collaborationrepresented. In this example it would probably not be useful to subdivide the non-collaboration links because they are not interesting for the research question.

Pilot studies• Are there enough links to give useful results in a full-scale

investigation?• Test at least 40 random link counts.

• Are the types of links appropriate to address the researchquestion?

• Design a relevant classification scheme and classify atleast 40 random links,

FULL-SCALE RANDOM SAMPLING

A classification of at least 160 randomly chosen links from the full data set is needed tosupport the validity of the results reported. As mentioned above, 160 would be sufficient toensure that if at least 50% of the pages contain relevant information then there is an 80%chance that in the full sample at least 40% will contain relevant information. See the nextsection for details about how to choose the size of a random sample if there is a need foraccurate category sizes, or to test specific hypotheses about the proportion of links that fitcertain categories. The full-scale classification will be similar in design to the pilot studyclassification, but will need to use methods that are scientifically sound and defensible inorder to persuade skeptical readers.

First, the sampling of links must be as random as possible. Ideally, this would involvea random number generator and a procedure for ensuring that all links had an equal chance ofbeing selected. The simplest procedure is to assign each link a number and then use therandom number generator to select from these numbers. In some cases the full data set willnot be available, for example if using a search engine and the number of link pages exceedsthe maximum number that the search engine will display. In situations such as this there maybe no alternative to taking a random sample from the links that can be found, but this mustthen be admitted to bo a potential source of bias in the reporting of the results.

The classification scheme itself should ideally be based upon an establishedmethodology, such as content analysis (Neuendorf, 2002). With content analysis, in additionto the actual classification scheme, prescriptive reasons for assigning each category also needto be given. More than one classifier should then classify the data and an inter-indexerconsistency test can then be used to estimate the accuracy of the classification process. Aproblem with this is that most previous studies have found links very difficult to classify, inthe sense of not getting high levels of agreement between classifiers, despite extensive


training and detailed classification schemes (e.g., Harries, Wilkinson, Price et al., 2004). Apractical, but second best, solution is to admit the fallibility of the classification process, butreport the results anyway. This reduces the strength of classifications as evidence for validity,but it is still essential to use them. To try to ensure high levels of inter-indexer consistency,the categories should be kept as simple as possible, avoiding categories irrelevant to theresearch goals.

Two critical issues should be addressed after the classification exercise. First, is a highenough proportion of links relevant to the research question? Second, what do the linksrepresent or signify? Both of these questions are troublesome to address. It is not possible tobe specific about the percentage of links that must be relevant to the research goals in order torender a link analysis meaningful. Probably if the 'vast majority' are relevant, say at least80%, then there should be no problem, but if only a 'tiny minority' are relevant, say less than20%, then the research is unlikely to give meaningful results. Between these two figures thefindings may have some value but should be interpreted cautiously. The degree to whichlower percentages of links can still give useful results will depend upon the scale of the studyand the degree of non-random behavior in the unwanted link types. For example if theunwanted links were created at random (which is very unlikely) and the total number of linkswas very high, then the law of averages would apply and significant patterns could beextracted from even a tiny minority of relevant link types. In practice, however, low ormedium percentages of links will at least undermine the validity of the findings, andcorrelation tests could be used as an additional source of support for assertions of validity.As a final point, small-scale studies may opt to classify all links and only analyze those that fitinto an appropriate category, giving 100% type validity. For example, a study of collaborationlinks between researchers' home pages may exclude all links between home pages that did notindicate collaboration.

When reporting results, particularly in an academic paper, care should be taken toexplain the categories used. This is needed because of the high degree of ambiguity in shortcategory descriptions such as 'research-related'. The following are two alternative methods, atleast one of which should be used.

• Exemplars Report one or two examples of pages or links that fit each category, explainingwhy they fit (e.g., Cronin, Snyder, Rosenbaum, Martinson & Callahan, 1998).

• Extended descriptions Report extended descriptions of each category used. If using acontent analysis, the full content analysis descriptions can be reported (e.g. Harries,Wilkinson, Price et al., 2004).

Full-scale random sampling• Design a relevant classification scheme using the results of the

pilot study and a literature review.• Produce a randomized list of all links, or other effective method

for choosing links at random.• Classify at least 160 random links.• Use content analysis and multiple classifiers for stronger

results.• Report extended category descriptions or exemplars.


CONFIDENCE LIMITS FOR CATEGORIES

Once a classification exercise has been completed, it is logical to ask is how accurate thecategory sizes are. For example, if 30 out of 100 pages were classified as research-related,how likely is it that a larger sample would have found that only a tenth of pages wereresearch-related? This question can be addressed using some statistics (e.g., Neuendorf, 2002,p88-91). There are some standard formulae that can be used to calculate the range of likelyvalues for a percentage based upon the results of a random sample.

First, it is appropriate to introduce some statistical terminology. The random samplefor classification is called the sample in statistics, and the full set from which it is drawn iscalled the population. From a random sample, the proportion of pages (or links) matching acategory can be calculated, but what the researcher would really like to know is the proportionof matching pages in the whole population. In statistics, it is common to calculate a range ofpossible values that has a 95% chance of containing the real proportion for the wholepopulation. This is sometimes called a 95% confidence interval, and the lower and upperlimits of the interval are called the confidence limits.

Suppose that c pages fall into a given category out of a random sample of n pages.Then the estimated proportion of pages in the category for the whole population is p = c/n.The following formulae give the lower and upper limits for possible values for the correctproportion for the whole population. There is a 95% chance that the real populationproportion lies between these limits.

lower = p -1.96 l p ( 1 ~ p ) (4.1)V n

upper = p + 1.96, r ( 1 ~ P ) (4.2)V n

Example Suppose that when 160 links are classified, 90 fall into a research-related categoryand the researcher would like to know whether at least half of all links in the wholepopulation are research-related. The proportion of research-related links in the sample is90/160 = 0.56. The sample proportion of 0.56 is higher than 0.5, but it could be higher bychance. In formulae 4.1 and 4.2, n = 160 and p = 0.56. Substituting these values (using fullcalculator accuracy) gives the following results.

tow, = 0.56-1.96j a 5 6 ( 1- a 5 6 )=0.49\ 160

/Owe^0.56 + 1.96j a 5 6 ( 1- a 5 6 )=0.64V 160

Thus, we can be 95% sure that the proportion of research-related links in the full data set(population) lies between 0.49 and 0.64. In particular, although it looks likely that themajority of links are research-related, the evidence from the confidence intervals does notallow us to claim statistical support for this conclusion, because 0.49 falls inside the


confidence interval and is less than 0.5. In other words, the number of research-related links isnot high enough to rule out the possibility that less than half of the links in the population areresearch-related.

In the above example, the confidence limits are quite widely spread: there is onlyevidence to show that the proportion of research-related links is somewhere between 0.49(49%) and 0.64 (64%), a difference of 15%. In order to have a difference smaller than 15%, alarger sample would be needed. If the width of the confidence intervals for categories isimportant, then a calculation can be used before the classification exercise to determine asuitable sample size. The calculation below is a variant of formula 4.1, and is designed toguarantee having a small enough confidence interval. The variables are as above, except that estands for the allowable error: the amount by which the limits can be above or below the pvalue. In other words 2e is the desired maximum width of the confidence interval. A p valueof 0.5 is assumed, as a worst-case scenario (for different p values, the same sample size willgive a smaller confidence interval). Formula 4.3 is derived from Neuendorf (2002, p90,formula d, with zc = 1.96)

Example If confidence intervals of width 4% are desired, then 2e = 0.04 so e - 0.02, andputting these numbers into equation 4.3 (see below) a sample size of 2,400 is needed toguarantee a 95% confidence interval of width no more than 4%.

„ = * * = 2,4000.022

It is important to use statistical approaches, as above, with care and not to overstate the needto analyze large samples in order to get narrow confidence intervals. The web is changingevery day and so a high level of precision is not appropriate for most categorisation exercises.If the results of a large-scale classification exercise were published in a journal article then theweb would have moved on from the date of the classifications and so the results would needto be interpreted cautiously with respect to the current state of the web. A consequence of thisis that categorisation exercises give results that are typically of an exploratory, rather thanconfirmatory nature. Researchers should not report their results as being correct and definitiveabout the nature of the web, but only as being estimates at a given point in time. In otherwords, web research can give insights into the way the web is used but cannot give definitiveconclusions because of the dynamic nature of the web.

CORRELATION TESTING

Correlation testing is assessing the correlation between link count data and another datasource of known value and meaning. This other data source should be related to the researchgoals. For example, in a study of universities, an established measure of research productivitymay be used, whereas in a study of IT companies, annual growth rate may be preferred. Asignificant correlation between link count statistics (e.g., inlinks) and another independent


measure is evidence that there is some pattern in the link data, and is suggestive of aconnection between the two data types. The following two points are important considerationsfor interpreting results.

• A statistically significant correlation between two phenomena does not imply that one isthe cause of the other: there may be an unrelated factor that influences both (Vaughan,2001, p. 100).

• Size should be factored out. For example, large organizations will tend to have large websites, many employees and high revenues. A comparison between measures of any two ofthese will probably show a significant correlation because of the underlying size factor.Size should therefore be eliminated before testing for any relationship. In fact, this is aspecial case of the first point, but is common enough to be worth a special mention.

Correlation tests• Identify a quantitative data source of known validity, relevant

to the research question.• Factor out size, e.g. by dividing the correlate, and link counts,

by an appropriate size measure.• Assess the correlation between the link counts and the new data

source, probably using a Spearman test.• Report results, remembering that correlation does not prove

causation.• Plot a scatter graph and analyze anomalies individually,

seeking explanations.

Note that correlation tests can never prove the validity of any data source, because they do notdemonstrate causation. Nevertheless, in conjunction with random sample classification, theycan provide strong support for the validity of an interpretation of link count statistics. Forinstance, if it is discovered that university inlink counts strongly associate with researchproductivity then this does not prove that research attracts links but, in conjunction withclassification results, it can provide evidence of a connection between research and links.

A non-parametric correlation coefficient such as Spearman's should be used ratherthan the Pearson correlation coefficient, because of the power law distribution typical of links(>chapter 5).

Correlation assessments can also be used to identify anomalies in the data set. If thereis a strong correlation between links and another data source, then a scatter-graph of the twodata sets, or statistical techniques such as Mahalanobis distance (Tabachnick & Fidell, 2001,p68), may show up web sites that do not fit the trend. These can then be investigated to findout why the sites have attracted more (or less) links than expected. Finding the answer maygive useful insights into the nature of the connection between the two data sources.


Finally, the following quote about quantitative studies in management science is asuccinct description of the role that correlation tests can play. This also relates to webindicator building (>chapter 24).

In most problems there are a number of variables to be considered. Our firstobjective is to find the correlation between these variables, our second to testthe stability of the correlation by finding the causal link lying behind it, andour final objective is to estimate the likely consequences of any particularchange that may be imposed upon the system. We thus provide ourselves witha quantitative basis for any decision to be made regarding the change.

(Goodeve, 1948, p379)

LITERATURE REVIEW

A literature review can be a third useful source of validity evidence. If there are similarpublished studies, then evidence of their correlation and categorization tests can be presentedas additional evidence. Relevant issues to consider when assessing the similarity of otherinvestigations include differences in time and country, as well as the similarity of the types ofsite that have been investigated.

SUMMARY

Pilot studies are essential for link analysis in order to ensure that there will be enough links tomake a full-scale study worthwhile, and that there will be enough links and of the right type toaddress the research goals. A full-scale classification exercise should then be used to supportthe validity of conclusions drawn from the data. Choosing the sample size is a difficult issuebecause web use changes over time and there may be little point in choosing a sample largeenough to get very accurate results, because the accuracy will be suspect even after a fewmonths have elapsed. Smaller sizes are more common in the research literature, but theseprovide only estimates rather than statistically supported conclusions, a fact that should bemade clear in the reporting of the outcome. If possible, the link count data should also becorrelated with other data sources of known value in order to support the reliability andvalidity of the interpretation of the results.

FURTHER READING

See Oppenheim (2000) for a discussion of a range of issues that should be taken into accountwhen validating a new data source. Oppenheim's discussion is of patent citations, but theseextend naturally to web links. Examples of random sampling approaches can also beconsulted, both to evaluate their methodologies and to obtain ideas for link typologies (Bar-Ilan, 2004; Harries, Wilkinson, Price et al., 2004; Wilkinson, Harries, Thelwall, & Price,2003).


A content analysis textbook, such as Neuendorf (2002) is a useful introduction to thetechnique, its procedures and measures. It is worth reading even if full-scale content analysisis not used. A statistics book should also be consulted to ensure that correct conclusions aredrawn from the correlation tests (e.g., Vaughan, 2001). Random sampling can be helped by arandom number table in a statistics book or the random number generator in Microsoft Excel.

REFERENCES

Bar-Ilan, J. (2004). A microscopic link analysis of academic institutions within a country: Thecase of Israel. Scientometrics, 59(3), 391-403.

Bar-Ilan, J. (2005, to appear). What do we know about links and linking? A framework forstudying links in academic environments. Information Processing & Management.

Brin, S., & Page, L. (1998). The anatomy of a large scale hypertextual web search engine.Computer Networks and ISDN Systems, 50(1-7), 107-117.http://www7.scu.edu.au/programme/fullpapers/1921/coml921.htm

Cronin, B., Snyder, H.W., Rosenbaum, H., Martinson, A., & Callahan, E. (1998). Invoked onthe web. Journal of the American Society for Information Science, 49(14), 1319-1328.

Davenport, E. & Cronin, B. (2000). The citation network as a prototype for representing trustin virtual environments. In: B. Cronin and H.B. Atkins (eds.), The web of knowledge:a festschrift in honor of Eugene Garfield. (Information Today, Metford, NJ, pp. 517-534).

Garrido, M. & Halavais, A. (2003). Mapping networks of support for the Zapatistamovement: Applying social network analysis to study contemporary socialmovements. In: M. McCaughey & M. Ayers (Eds). Cyberactivism: online activism intheory and practice. New York: Routledge, pp. 165-184.

Goodeve, C. (1948). Operational research. Nature, (13 March), 377-384.Harries, G., Wilkinson, D., Price, E., Fairclough, R. & Thelwall, M. (2004, to appear).

Hyperlinks as a data source for science mapping, Journal of Information Science,30(5).

Hernandez-Borges, A. A., Macias-Cervi, P., Gaspar-Guardado, M. A., Torres-Alvarez deArcaya, M. L., Ruiz-Rabaza, A. & Jimenez-Sosa, A. (1999). Can examination ofWWW usage statistics and other indirect quality indicators distinguish the relativequality of medical web sites? Journal of Medical Internet Research, 1(1). Available:http://www.jmir.Org/1999/l/el/index.htm

Howell, D.C. (2002). Statistical methods for psychology. Pacific Grove, CA: Duxbury.Ingwersen, P. (1998). The calculation of Web Impact Factors. Journal of Documentation,

54(2), 236-243.Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the

ACM, 46(5), 004-632.Neuendorf, K. (2002). The content analysis guidebook. London: Sage.Oppenheim, C. (2000). Do patent citations count? In: Cronin, B. & Atkins, H.B. (eds.). The

web of knowledge: a festschrift in honor of Eugene Garfield. Metford, NJ: InformationToday Inc. ASIS Monograph Series, 405-432.

Palmer J. W., Bailey, J. P., & Faraj. S. (2000). The role of intermediaries in the developmentof trust on the WWW: The use and prominence of trusted third parties and privacy


statements. Journal of Computer-Mediated Communication, 5(3). Retrieved May 6,2004 from http://www.ascusc.org/jcmc/vol5/issue3/palmer.html.

Park, H., Barnett, G. & Nam, I. (2002). Hyperlink-affiliation network structure of top websites: Examining affiliates with hyperlink in Korea. Journal of the American Societyfor Information Science and Technology, 53(7), 592-601.

Park, H. & Thelwall, M. (2003). Hyperlink analysis: Between networks and indicators.Journal of Computer-Mediated Communication. 8(4). Available:http://www.ascusc.org/jcmc/vol8/issue4/park.html

Smith, A. G. (1999). A tale of two web spaces: Comparing sites using web impact factors.Journal of Documentation, 55(5), 577-592.

Tabachnick, B. & Fidell, L. (2001). Using multivariate statistics, 4th edition. NeedhamHeights, MA: Allyn and Bacon.

Thelwall, M. (2003). What is this link doing here? Beginning a fine-grained process ofidentifying reasons for academic hyperlink creation, Information Research, 8(3), paperno. 151. Available: http://informationr.net/ir/8-3/paperl51.html.

Vaughan, L. (2001). Statistical methods for the information professional. Medford, NewJersey: Information Today.

Vreeland, R.C. (2000). Law libraries in hyperspace: A citation analysis of World Wide Websites, Law Library Journal 92(1), 9-25.

Wilkinson, D., Harries, G., Thelwall, M. & Price, E. (2003). Motivations for academic website interlinking: Evidence for the web as a novel source of information on informalscholarly communication, Journal of Information Science, 29(1), 59-66.

This page is intentionally left blank

Link Structures in the Web Graph 47

PART II: WEB STRUCTURE

LINK STRUCTURES IN THE WEB GRAPH

OBJECTIVES

• To model web linking and growth through power laws.• To show how the web can be grouped into a few topological categories based upon the

overall link structure of the web, and how similar categories can be identified inacademic webs.

INTRODUCTION

The web can be represented at a very abstract level by discarding the contents of all pages andjust considering the links between pages, or between any other type of ADM document. Thisproduces a mathematical object called a directed graph, or digraph. The attraction of thisextreme level of abstraction is that its simplicity may allow the discovery of basic laws thatapply to the web, laws that the contents of pages might obscure. An additional motivation,particularly for computer scientists, is that crawlers only consider links between pages and nottheir contents and so findings about the web as a directed graph may help the design ofcrawlers or the development of crawling strategies. Web digraph research has attempted tomodel the growth of the web to describe its overall structure; identify the distribution of linksbetween pages through links alone. Most of this research has been applied to the general webbut some has been aimed at academic web spaces.

Before describing any results, some terminology must be introduced. In mathematicalgraph theory, a directed graph is made up of a collection of objects called vertices or nodes

5


and a collection of connections between nodes, called arcs, arrows or links. The indegree of anode is the number of arcs that point to it, and its outdegree is the number of arcs thatoriginate at the node. The web is a directed graph with pages being nodes and hyperlinksbeing arcs. Directed graphs occur in many situations and there is a body of standardmathematical findings about them that can be applied to any newly discovered directed graph.

A related kind of graph is normally just called a graph and is identical to a directed graphexcept that the connections between nodes do not have a direction, and are called edges. Thedegree of a node in a graph is the number of edges attached to it. Nodes in graphs do not haveindegrees or outdegrees. The web is not a graph because links between web pages aredirectional. For example, if there is a hyperlink from page B to page A then it can be followedfrom page B to page A but not from A to B. Nevertheless, it can sometimes be useful toconvert a directed graph into a graph by loosing the direction of the links. Figure 5.1 showsthe effect of converting a collection of three web pages from a directed graph into theunderlying graph.

Figure 5.1. A directed graph (left) and its underlying graph.

The change in the structure is fundamental because in the directed graph all pages can bereached from B alone, but B can not be reached from A or from C, while the undirectedversion is a 'loop' where starting at any page, both of the other pages can be reached. Despitethe important loss of information when moving from a digraph to the underlying graph, theadditional abstraction is sometimes helpful.

POWER LAWS IN THE WEB

Probably the most important fact about web linking, as abstracted to the digraph (or graph)level, is that the inlink and outlink degrees (or degrees) of nodes follow a power law. This isexplained in more detail after some historical context.

The connectivity of graphs has been studied in mathematics for a long time, allowingmathematics to predict properties of large graphs. One of these properties is the length of theshortest path between two random nodes. A path is a collection of edges (or arcs if a digraph)that form an unbroken chain from a start node to an end node. If a large graph is constructedat random then the shortest path between any two nodes should be typically quite long.Similarly, if a graph is put together in a very regular way, for example in a large circle witheach node connected to its neighbors, then the shortest path between two random nodesshould also typically be very long. These neat mathematical findings were frequently notfound in real life examples of graphs, such as the graph of social acquaintances. If a graph is


built from all the people of the world, connecting everyone to all of their social acquaintances,then we would expect the average shortest path (chain of acquaintances) between two nodes(people) to be very long, if social acquaintances were formed either completely at random or(more likely) in a highly structured way more similar to the circle form described above. Afamous experiment of Milgram (1967) discredited this conjecture. Milgram wondered howlong the average chain of acquaintances would be between pairs of random strangers. Heselected 160 people from two towns in the USA and gave them a letter, requesting that theyforward it to somebody who they thought could help get it to the final recipient, a person in adistant US state. Of the 42 that reached their destination, the average number ofintermediaries was a surprisingly short 5.5. This lead to a popularization of the notion of 'sixdegrees of separation'; that all pairs of people are separated by an average of 6 acquaintances.

The mismatch between the mathematical theory and reality meant that a new modelwas needed. It was produced by Watts and Strogatz (1998) with their small world theory.They modeled social networks as highly regular systems, as in the circular arrangementdescribed above, but with the addition of a number of 'shortcuts': connections between nodesthat are not near each other in the regular structure. This new mathematical modelsuccessfully explained the small world nature of some networks, with very short averageshortest paths. Applied to the real world, this model also made sense. People tend to haveregular and tightly grouped groups of acquaintances, e.g. work, family and friends. But theymay also have a few shortcuts to other circles, for example through a brother-in-law from adifferent city, or business trips abroad. The small world theory was also found to fit the webwell; short average path lengths were found between random pairs of web pages from crawlsof sites. For example, for a crawl of 325,729 pages from the nd.edu domain, the averageshortest path between any two pages was 11.6 links (Albert, Jeong & Barabasi, 1999).Unfortunately, however, the small world model was wrong in the sense of not being anaccurate model of the web, social networks or many other naturally occurring graphs. It hasbeen replaced with the power law theory. There is a big difference between the two theoriesthat has important implications for how to intuitively imagine the networks and for whichtechniques work for them.

The power law theory of networks implies that the small world phenomena found instructured networks such as the web is not the result of random short cuts but of a few highlyconnected nodes. For Milgram's social network, the implication for letter senders would be totry to get their letter to an extremely well-connected person, who would probably know howto get it quickly to its target. The implication for finding shortest paths on the web would beto try to follow links to a highly connected site, such as Yahoo!, and then expect to be able toget quickly from Yahoo! to the target page. An unexpected consequence is that power lawnetworks are more vulnerable to 'attacks' because removing the highly connected nodessignificantly increases shortest path lengths (Albert, Jeong, & Barabasi, 2000).

The mathematics of power laws precedes their application to networks and is subtlerthan the above explanation suggests. In particular, there is no binary divide between highlylinked nodes and the rest; there is instead a continuum. The power law is also known asLotka's Law in information science (Lotka, 1926). The version used in this book is theformula / = y a for frequency/, where a and b are constants and n is a variable quantity,


such as inlink count or outlink count. For example, if n is the inlink degree of pages in the= b/web, since these follow a power law, values of a and b can be found such that / = "/ is the

/ nnumber of pages with n inlinks. Plotted on a graph this gives something like Figure 5.2 below.In fact the data hugs the axes so tightly that it is almost obscured, giving the impression of abinary divide between the vast majority of pages having almost no links and a tiny minority ofpages having an enormous number of links. Figure 5.3 is the same as Figure 5.2 but convertedto a log-log scale to show more detail. It shows that the binary divide is indeed anoversimplification. Note that there are some isolated points on the right of Figure 5.3,indicating that the power law is not a perfect fit for this particular data set. These arediscussed later.

Figures 5.2 and 5.3. A power law with a linear axes scale and logarithmic axes scale(Australian university web pages).

MODELS OF WEB GROWTH

A number of researchers have attempted to develop mathematical models of the growth of theweb (and other networks) in order to explain the power law phenomena found. These models,typically built by computer scientists or theoretical physicists, have tended to focus on theabstract level of the web graph, but more recent models have attempted to incorporate alimited amount of extra information about the context of pages, as will become clear below.The construction of increasingly complex models of web growth has become an industry, butin this section only simple models that give directly useful results for information science linkanalysis are considered.

The most important concept underlying the web growth models is that of 'rich getricher' or 'preferential attachment'. In their application to the print form of the learnedliterature the ubiquitous power laws are often referred to as 'success breeds successmechanisms,' or as the 'Matthew effect'. In other words, a key rule is that new links will tendto attach themselves to pages that already have a lot of links. Without a rule like this, it isdifficult to see how the web could grow to a point that there was as large a disparity in link


counts between nodes as is found in reality. The original growth model that successfullypredicted power laws was derived by Barabasi & Albert (1999). The rule for network growthwas simple, but effective. A network could be grown by adding links and pages one by one.Links would not be added at random, however, but would preferentially attach to pages thatalready had links attached. This was formulated as a probabilistic model so that theprobability that a new link attached to a page was proportional to the number of links alreadyattached to the page. This preferential attachment model successfully predicted power lawsand seemed to partially reflect the reality of the web: sites with many links to them are morelikely to be found by users and presumably are also more likely to be linked to.

The basic preferential attachment mechanism had been previously proposed indifferent guises and in different fields. The preferential attachment model is extremelysimplistic and subsequent models have sought to rectify this. For example, geography is nottaken into account, nor is the fact that pages are most likely to link to other pages in the samesite. Solving these problems is not of interest here, but relating the model to page types isuseful, and is discussed next.

Pennock, Flake, Lawrence, Glover, and Giles (2002) assessed the basic powerlaw/preferential attachment model for specific types of pages rather than the whole web. Theyfound that the pure power law distributions of inlink and outlink degrees for largeheterogeneous collections of web pages does not apply to some subsets of specific types ofpage. For instance, the inlink distribution of university home pages is not a power law: thereare hardly any pages with few inlinks. A pure power law would have predicted a much largernumber. This can be explained by the existence of a second type of linking law, uniformattachment. With uniform attachment, new links are added to the system completely atrandom and irrespective of the number of links already attached to pages. If this were appliedon its own, then all pages would gain approximately the same number of links. The newmodel of Pennock et al. applied both uniform linking and preferential attachment, but inproportions that varied by page type. For example, links to scientists' home pages followed apower law quite closely, but links to university home pages showed a much stronger uniformattachment tendency, manifested in low numbers of pages with few inlinks.

It seems reasonable to accept that although there are many factors that cause authors tochoose where to link to, the number of links received by the target page is an important one.This is not to believe that many authors would actually check link counts for potential linktargets, but a cyclical process is likely to be at work. Fundamentally, a page with many linksto it is easier to find, either through browsing or through search engine searches (since it willrank higher). Also, the author of a page with many visitors, perhaps because of the high inlinkcount, could be expected to take care of a page and ensure that it was worth visiting. On acommercial level, visitors translate to profits and so a successful site ought to be maintainedand reinforced. The counteracting tendency towards uniform linking to certain types of pagescan be explained by factors that are independent of links, perhaps even external to the web. Inthe case of university home page inlinks, reasons that may be independent of inlink degreesinclude: academics may link to the home pages of partner institutions; universities aregenerally well enough known, at least in their own country, for the general public to be ableto search for and find their sites; the creation of exhaustive link lists, containing links to allUK university home pages.


There are two important statistical implications of web power laws. First, power lawsare very far from Normal distributions and so parametric statistics are inappropriate on rawlink count data. Second, many statistical tests require the data to be independent - in otherwords the data points do not influence each other. The growth models point to exactly theopposite: link growth appears to be significantly influenced by the existing network structure.Statistical tests can often survive a degree of violation of their hypotheses, such asindependence and distribution shape, but the growth models are all major violations ofindependence and normality, and at the very least this requires a cautious interpretation oflink count statistics.

LINK TOPOLOGIES

The idea of studying the link structure of the whole web originates in the work of scientistsattached to the search engine AltaVista using software designed to analyze the link structureof AltaVista's databases (Broder, Kumar, Maghoul et al., 2000). Although often described asa 'whole web' study, AltaVista's databases were estimated at the time to cover only anestimated 15.5% of the publicly indexable (see the glossary for a definition) web (Lawrence& Giles, 1999). Nevertheless, it probably included the more popular pages, reflecting userexperiences and needs. The analysis combined directed graph and graph representations of theweb to split the data into five parts, as shown in Figure 5.4 and Table 5.1.

• SCC (Strongly Connected Component) is the largest group of pages with theproperty that from any page all other pages in the set can be reached by followinglinks in their original direction.

• OUT is the collection of pages that are not in SCC but can be reached by followinglinks from SCC.

• IN is the collection of pages that cannot be reached by any page in SCC but canreach pages in SCC by following links.

• TENDRILS is the connection of pages that are not in IN, OUT or SCC but arejoined to either IN or OUT by following links between pages either forwards orbackwards or in a combination.

• DISCONNECTED is the remainder of the pages. These are not be connected inany way to the other four components.

Table 5.1. Sizes of components in a May 1999 AltaVista crawl (Broder, Kumar, Maghoul etal., 2000)

RegionSize

SCC56,463,993

IN43,343,168

OUT43,166,185

TENDRILS43,797,944

DISC.16,777,756

Total203,549,046


Figure 5.4. The topological structure of an AltaVista crawl, 1999 (Broder, Kumar, Maghoulet al., 2000), see also Bjorneborn (2004, p79).

The structure described above is universal in the sense that it could be used to describe anydirected graph. The importance, therefore, is in the relative size of the five components.DISCONNECTED is the smallest, but at 8% of the total crawl represents a sizable number ofpages that are isolated from the main part of the web.

The size of SCC can be interpreted as indicating that the web contains a huge core ofpages that interconnect through chains of links. This could be expected to contain all theimportant portal sites, for example, such as Yahoo!. The implication for crawlers - one of theobjectives of the Broder et al. study - was that any good starting point for a new crawlerwould yield about half of the current database: OUT + SCC. More of interest for this book,however, are the parts IN, TENDRILS and DISCONNECTED. How did AltaVista find thosepages, if not by following links? There are four other ways that a crawler could theoreticallyfind pages.

• User submission of URLs Web site owners can register their sites with search enginesby submitting their URLs. These can then be added to the initial crawl list.

• Memory A search engine can remember URLs found by following links in previouscrawls, even if the link source pages have subsequently been deleted or changed.

• Guessing Search engines can guess additional URLs in a limited number of ways. Oneway is to truncate all URLs found at each slash, in an attempt to find the home pagesof each directory. In other words, if the URL http://www9.Org/w9cdrom/160/l 60.htmlis found, then the crawler might automatically try to fetchhttp://www9.org/w9cdrom/160/, http://www9.org/w9cdrom/ and http://www9.org/ inaddition to the original URL.

• Non-link information AltaVista may possibly not record all URLs found in itsconnectivity server. For example, when a web page has moved, a request for theoriginal URL may not result in a web page but a direct message from the web serverto the web browser telling it the new location. This typically takes the form of a


Hypertext Transfer Protocol (HTTP) redirection request. If the redirection informationis not added to the web graph then this could change the size of the components,perhaps decreasing the size of IN.

The AltaVista study is interesting for its overview of web connectivity. Presumably thefigures for the relative size of the components would be substantially different if the pagesAltaVista could not find were added to it. These should mainly increase the size of IN andDISCONNECTED. Perhaps more fundamentally for this book, however, the study did notreport on the types of page found in each component. This information would have beenuseful to relate the topology to the web as users experience it.

POWER LAWS AND LINK TOPOLOGIES IN ACADEMIC WEBS

Power laws with anomalies typify many different characteristics of academic webs (Thelwall& Wilkinson, 2003). For example, Figure 5.3, of Australian university web page inlinkdegrees, shows a power law with several isolated points on the right of the graph that do notfit the trend. Power laws are natural to the web, as discussed above, but anomalies arefrequent in large collections of academic pages, typically the result of enormous collections ofpages from a single subsite. For example, if a collection of 10,000 pages each has a link tofour common pages (e.g., home page, help page, credits page, legal disclaimer) then this willresult in an anomaly: four pages with an inlink count of 10,000 generated by a single cause.

Figure 5.3 comes from an investigation replicating the AltaVista study to report thetopology of collections of university web sites from Australia, New Zealand and the UK(Thelwall & Wilkinson, 2003). The data for each of these three national academic webs wasproduced by a version of SocSciBot, conducting a separate crawl for each university startingfrom its home page. As a consequence of finding pages exclusively by following linksstarting from site home pages, only SCC and OUT components were present in topologicalgraphs created for each country. Nevertheless, SCC and OUT are complete: because of theirdefinitions, all pages not crawled must come from the other three components. The crawlstrategy will have missed many pages on each site without (chains of) inlinks from SCC andOUT. The percentage sizes of OUT and SCC are very similar across the three countries (seeTable 5.2), which is suggestive of the existence of an underlying causative factor, and thatconclusions about them may be applicable to other countries' academic webs.

Since the sizes of TENDRILS, DISCONNECTED and IN are unknown, the one firmconclusion that can be made is that the relative size of OUT, at more than double the size ofSCC in all three countries, is much larger than in the AltaVista study, where it was smallerthan the SCC. Assuming that the combined size of TENDRILS, DISCONNECTED and IN isnon-trivial, the SCC forms a much smaller percentage of national academic webs than ofAltaVista's web coverage. Generating from this, it seems that national collections ofuniversity web sites have a smaller SCC core than the web. This conclusion refers toacademic webs taken in isolation from the rest of the web, ignoring links to and from pagesnot crawled.

Table 5.2 reports basic statistics from the crawls of Australia, New Zealand and theUK. All figures include only inlinks and outlinks to and from recognized university sites inthe same country. The longest shortest directed path statistic is the maximum number of linksthat must be clicked to get from a random source page to a random target page, when it is


possible to make a path. The values are surprising because of their large size: in a power lawenvironment, long shortest paths are not to be expected. In the academic webs they are causedby anomalous behavior, typically the creation by a single author of long chains of web pagesthat link to each other in a strict sequence. This kind of systematic link structure creationviolates the mathematical models of linking that lead to a power law.

Table 5.2. Selected link structure statistics for three national academic webs (Thelwall &Wilkinson, 2003).

Component

OUT(%)SCC(%)Total pagesTotal linksMaximum page inlinksMedian page inlinks (all/SCC/OUT*)Maximum page outlinksMedian page outlinks (all/SCC*/OUT)Number of different indegreesNumber of different outdegreesLongest shortest directed path

Australia

2,548,27673%

963,23127%

3,511,50718,031,706

42,9031/2/1

20,0000/8/01,465

947

362

NewZealand

213,07870%

92,10230%

305,1801,874,141

9,5991/3/13,3781/7/0

523

298

1,445

UK

4,557,99870%

1,995,60230%

6,553,60031,250,705

27,8971/3/1

19,9990/6/01,6861,4611,022

*OUT inlinks include all links from SCC and SCC outlinks include all links to OUT.

SUMMARY

The abstract graph structures underlying the web have led to useful insights about the way inwhich the web grows and its overall structure. Power laws and preferential attachment are keyconcepts. When new links are added to the web, a proportion of them preferentially attach toweb pages that already have many links. Over time, this results in a huge disparity in links,with a tiny minority attracting an enormous amount. This explains small world phenomena,where most pairs of web pages are connected by relatively short chains of links (if thedirection of the links is ignored). This may also help to explain the topology of the web.Presumably most highly interlinked hubs are in SCC, most pages with few links are in IN orOUT, and all pages without links are in DISCONNECTED.

The same laws apply to academic webs but to differing extents. For example,preferential attachment is relatively weak for university home pages and it is likely that theSCC is relatively small. A significant difference in academic webs is the prevalence ofanomalies. These show up as points in the wrong place on power law graphs, and very longshortest paths in topological graph investigations. The root cause of both is the systematiccreation of large collections of highly structured pages within university web sites.

Finally, power laws make it inadvisable to use parametric statistics to analyze linkcounts, unless transformed to a normal distribution. Moreover, the preferential attachment


model used to explain power law growth undermines assumptions of independence instatistical tests, so conclusions from such tests should be treated with caution.

FURTHER READING

For more on graph theory, try the excellent introductory articles in American Scientist(Hayes, 2000a,b). Good non-mathematical overviews of graph structures of the web, andother similar networks, can be found in the popular books of Barabasi (2002) and Huberman(2001). For the mathematically inclined, an excellent book on web modeling is Baldi,Frasconi & Smyth (2004), and for web growth models and web dynamics there is the editedvolume of Levene and Poulovassilis (2004). For state of the art growth models, a search of thearXiv.org preprint sever is advised. Mathematicians will also enjoy the details given in themodelling paper of Pennock, Flake, Lawrence, et al. (2002).

Baeza-Yates, Castillo and Saint-Jean (2004). (2001) have extended the fivetopological components model by dissecting some of them mathematically. LennartBjorneborn's (2004) thesis is a treasure trove of ideas, pictures, findings and techniques. Itwas available on the web at the time of writing (http://www.db.dk/lb/phd/). In one sectionBjorneborn describes the topology of the domains of the UK academic web after excludinginternal site links. The resulting graph reveals all five components and the types of domainoccupying each one were investigated.

REFERENCES

Albert, R. Jeong, H. & Barabasi, A. (1999). Diameter of the world wide web. Nature, 401,130-131.

Albert, R. Jeong, H. & Barabasi, A. (2000). Error and attack tolerance of complex networks,Nature, 406, 378-382.

Baeza-Yates, R., Castillo, C, & Saint-Jean, P. (2004). Web dynamics, structure and pagequality. In: M. Levene & A. Poulovassilis (Eds). Web dynamics. Springer: Berlin (pp.93-109).

Baldi, P., Frasconi, P., & Smyth, P. (2004). Modeling the Internet and the web: Probabilisticmethods and algorithms. New York: Wiley.

Barabasi, A.L. & Albert, R. (1999). Emergence of scaling in random networks. Science, 286,509-512.

Barabasi, A.L. (2002). Linked: The new science of networks. Cambridge, Massachusetts:Perseus Publishing.


Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A. &Wiener, J. (2000). Graph structure in the Web. Computer Networks, 33(1-6), 309-320.

Hayes, B. (2000a). Graph theory in practice: Part I. American Scientist, 88, 9-13. Available:http://www.americanscientist.org/template/AssetDetail/assetid/14708

Hayes, B. (2000b). Graph theory in practice: Part II. American Scientist, 88, 104-109.Available: http://www.americanscientist.org/template/AssetDetail/assetid/! 4717


Huberman, B.A. (2001). The laws of the Web: patterns in the ecology of information.Cambridge, Mass.: MIT Press.

Lawrence, S. & Giles, C. L. (1999). Accessibility and distribution of information on the Web.Ataure, 400, 107-110.

Levene, M. & Poulovassilis, A. (eds) (2004). Web dynamics. Berlin: Springer-Verlag.Lotka, A. (1926). The frequency distribution of scientific productivity. Journal of the

Washington Academy of Sciences, 16(12), 317-323.Milgram, S. (1967). The small-world problem. Psychology Today, 1(1), 60-67.Pennock, D.M., Flake, G.W., Lawrence, S., Glover, E.J. & Giles, C.L. (2002). Winners don't

take all: characterizing the competition for links on the web. Proceedings of theNational Academy of Sciences, 99(8), 5207-5211.

Watts, D.J. & Strogatz, S.H. (1998). Collective dynamics of 'small-world' networks.Nature, 393, 440-442.

Thelwall, M. & Wilkinson, D. (2003). Graph structure in three national academic webs:Power laws with anomalies, Journal of the American Society for Information Scienceand Technology, 54(8), 706-712.


The Content Structure of the Web 59

THE CONTENT STRUCTURE OF THE WEB

OBJECTIVES

• To demonstrate the relationship between web links and topics.• To investigate the relationship between web links and subjects in national systems of

university web sites.

INTRODUCTION

This chapter is concerned with the relationship between links and page contents. Web pagecontent refers to data extracted from the page's text alone when such data has been extractedby an algorithm, while when a human judgment is made about the contents of a page, not onlythe text but also layout and embedded graphics will provide the data defining web pagecontent. General findings about the contents of the source and target pages of links arereviewed, giving useful background information to build an understanding of the role andtypical use of links in the web. Some of the research reported here has been conducted for adifferent purpose - the construction of new web information retrieval algorithms - butprovides relevant findings.

It is a common belief that links in web pages are likely to target other web pages withsimilar contents. There have been several studies, some discussed below, that haveoperationalized this belief and established that it is indeed true. Computer scientists, inparticular, have been concerned with this iink-content hypothesis' because of the implicationthat links may be used in algorithms to find pages relating to a common topic. Somecommercial search engines and experimental web information retrieval systems use topic-clustering algorithms to group the results of queries by topic. The relationship betweencontent and links has different connotations in academic web spaces. Links, like citations,have the potential to be used to create maps of online relationships within and amongsubjects. A corollary of this is that in academic web spaces, existing subject and fielddesignations are more important to investigate than the more general topics analyzed in therest of the web.

This chapter splits into two halves. The first half is concerned with the web and thesecond with the academic web.

6


THE TOPIC STRUCTURE OF THE WEB

The first important large scale link-content investigation dates from 2002 and was conductedby a group of researchers at the Indian Institute of Technology (IIT) in Bombay, incollaboration with one investigator from the NEC Research Institute, Princeton (Chakrabarti,Joshi, Punera & Pennock, 2002). Its objectives included identifying the probability that pagesabout one 'broad topic' would link to a different broad topic, and measuring the 'backgrounddistribution' of topics on the web. The task of identifying the topic of web pages wasdevolved to a pre-existing published categorization structure, the Open Directory Project(dmoz or dmoz.org), which was used to train an automatic classifier. An automatic classifieris a computer program that has a set of rules to apply to the contents of a document to guessinto which category it should be placed. Training a classifier involves feeding it with a largenumber of classified documents so that it can formulate the rules that it uses to classify otherdocuments (Chakrabarti, 2003, chapter 5).

The IIT team took large samples of 10,000,000 web pages, using them to estimate thedistribution of the broad dmoz topics in the web. The sampling policy was a form of crawlingcalled a random link walk. This involves simulating a surfer browsing a large number ofpages by selecting links at random from each page viewed. A consequence of the method isthat it only includes the better-connected part of the web (OUT + SCC in the terminology ofthe chapter 5). The results are very interesting, nevertheless, because they are designed toreflect what web users actually experience. This is because the sample biased towards types ofpages that web users would be most likely to visit, i.e. highly inlinked ones. Figure 6.1 showsthe broad distribution of topics found.

Figure 6.1. The estimated distribution of topics in the web (Chakrabarti et al., 2002).

A second set of experiments, described in the same paper and using the same methodology tofind pages, investigated links to see how often they connected pages classified into differenttopics. Using the 12 dmoz top-level topics, the results showed that pages within a topic tendedto cite other pages within the same topic. Additionally, cross-topic links were not random;topics did not link uniformly to other topics but had preferences. These preferences were notalways reciprocated, so if topic A frequently linked to topic B then the converse was not


always true. The broad topics 'Computers' and 'Society' were especially popular inter-topiclink targets. At a finer level of detail, popular subtopic link targets included the following (thefirst term is the broad topic).

• Arts/Music• Arts/Literature• Arts/Movies• Computers/Security• Recreation/Outdoors• Society/Issues

An example of a high level of linking between a pair of subtopics was Arts/Music toShopping/Music, which needs no explanation. It would have been interesting from aninformation science perspective if some analysis had been made of why the subtopics listedabove were so popular, but this was not an objective of the article. This is frustrating becausereplicating the experiments would be impractical for most information science researchers,because of the computing resources needed.

The idea that pages tend to link to other pages covering the same topic was confirmed,using different methods and hypotheses, by Menczer (2005). His experiment demonstratesthat, on average, pages are lexically similar to pages that link to them. This is a differentapproach to that of Chakrabarti et al. (2002) described above, who classified pages into pre-defined topics. Menczer does not classify them but uses a measure of text similarity for pairsof documents. The average lexical similarity of a pair of pages decreases (with exponentialdecay) as the number of links between them increases, as would be expected. This isconsistent with longer chains of links leading to more opportunities for topic drift (cf.,Bjorneborn, 2004). This research has similar web coverage to that of Chakrabarti et al. (2002)discussed above. It relies on finding pages by links, but starting with Yahoo IDirectory ratherthan dmoz.

Summarizing both of the papers reported above: links are clearly related to thesemantic content of web pages. Although this can be successfully exploited by topicclustering algorithms and search engines, it is important to give the proviso that this is astatistical phenomenon. Pages do sometimes link to other pages covering completely differenttopics, but they are more likely to link to similar pages.

A LINK-CONTENT WEB GROWTH MODEL

It is logical to attempt to combine the link-content discoveries with the web growth models ofchapter 5 in order to create a web growth model that is sensitive to web content. This shouldlogically give a more realistic simulation of web evolution. One such model has beendesigned by Menczer (2002), which includes both content and preferential attachment. InMenczer's model, when a new page with links is added to the web, its links tend to attach toother pages based upon a combination of topic similarity and preferential attachment. Thepreferential attachment rule only applies to pages that are similar enough, i.e. pass a similaritythreshold. For pages that are not similar enough to pass the threshold, their (small) chance ofbeing linked to depends only upon their similarity, with very dissimilar pages having verylittle chance. Evidence was presented to show that the model is reasonable by comparing thestatistics generated by it to data from a web crawl.


LINK TEXT

Google's founders have from the beginning advocated the use of anchor text to helpdetermine the content of the target document (Brin & Page, 1998). Specifically, their belief isthat if a page contains a link then the link text (if any) and the text near the link is likely tosuccinctly describe the target page. This is useful to users to help them decide whether tofollow the link, and to search engines for a second opinion about the target page content. Ithas now been confirmed by experiment that linking document text can often give a moreeffective classification of the target document than the target document text (Glover,Tsioutsiouliklis, Lawrence, Pennock & Flake, 2002).

THE SUBJECT STRUCTURE OF ACADEMIC WEBS

In university web sites the most interesting link-content questions concern the relationshipbetween academic subjects and links. Do web pages tend to link to others within the samesubject? Are there subjects that tend to be linked to by other subjects, or pairs of subjects thatinterlink? Data to answer these questions is not available for the whole world, but has beenderived for a few countries through a series of separate investigations, all using differentmethodologies.

Figure 6.2. Subject-based high impact university sites in Australia and Taiwan (Thelwall etal., 2003).


The percentage of the highest inlinked subject-based web sites of universities in Taiwan andAustralia in 2003 is given in Figure 6.2, broken down by subject (Thelwall, Vaughan, Cotheyet al., 2003). The sites classified were found by crawling the university web sites in eachcountry applying the domain ADM to their link structures, then selecting the 100 highestinlinked subject-based web sites in each country. The categories used were the UNESCOinternational subject codes, but modified to reflect the importance of computing, which wasclassified as a sub area of mathematics in the original scheme. The prominent position ofcomputing is clear, as are national differences and the low showing of some subjects,including philosophy and ethics. The national variations seem to be due to differences inhigher education systems rather than specific to web publishing. For example, Taiwanesecomputer science was often embedded within electronic technology-oriented departments,such as electronic engineering, whereas Australian computer science tended to be hosted inseparate departments.

The most detailed known subject classification that has been applied to academic websites is that of the UK's national research assessment exercise (RAE), consisting of 68 subjectcategories. A random sample of 586 pairs of interlinked domain name-based web sites wereclassified. The results were compared to the number of active researchers in each subject areain order to determine which subjects had large or small web presences for their size. Table 6.1lists the 10 most over-represented and 10 most under-represented subjects. A full list can befound in the source article (Thelwall, Harries & Wilkinson, 2003). There is evidence of largedifferences in the extent of web use by the different subjects in the UK, at least in terms ofinterlinked web sites.

Table 6.1. Relative web presence of interlinked subject web sites in the UK (Thelwall,Harries & Wilkinson, 2003).

SubjectStatistics and Operational ResearchComputer SciencePure MathematicsLinguisticsFood Science and TechnologyApplied MathematicsElectrical and Electronic EngineeringLibrary and Information ManagementEnvironmental SciencesPhysicsAccounting and FinanceSports-related SubjectsHistory of Art, Architecture and DesignCommunication, Cultural and Media StudiesDrama, Dance and Performing ArtsPharmacyTheology, Divinity and Religious StudiesPhilosophyNursingLaw

Research-active faculty Domains

3871560510210118734863302541

1668218319347359396418439460575

1353

259729116

3634101736

0000000000

Domains perfaculty X 1000

64.762.256.952.351.149.139.433.131.421.6

0.00.00.00.00.00.00.00.00.00.0


The top 10 subjects include computing, which is to be expected, but also includes all threemathematical subject areas and library and information management. Perhaps most revealingis the list of subjects that were not represented at all. In some cases, such as law, this mayhave been because relevant pages were hosted in larger web sites dominated by other subjects.Some law web pages were probably buried inside business studies web sites, for example.The low number of history pages, at 1.2 domains per research-active faculty (not shown inTable 6.1) is particularly surprising given other research suggesting historians to be ratheractive web publishers (Nentwich, 2003).

The main purposes of the UK study were to identify whether there were pairs ofsubjects that tended to interlink, and whether there were common reasons for differentsubjects to interlink. The main reasons found for interlinking are summarized in Table 6.2. Afew pairs of subjects were found that tended to interlink, including physics and computerscience. One particular cause of cross-subject links was found to be the use of computing inother subjects. The use of higher education teaching resources, even those designed for oneparticular subject, also caused many cross-subject links.

Table 6.2. Categories of inter-subject linking in UK web sites (Thelwall, Harries &Wilkinson, 2003).

Another study of the UK used a coarser classification scheme to give a broad view of thedistribution of subject areas (Thelwall & Price, 2003). The five categories below were used,again taken from the official UK national research assessment exercise.

• I Medical and Biological Sciences• II Physical Sciences and Engineering• III Social Sciences• IV Area Studies and Languages• V Humanities and Arts

Category Description NumberSimilar subject The two subjects were perceived to be similar and the link 19(37%)

was created because of a subject connection.General Primarily related to teaching in higher education. 10(19%)educationalGeneral resource The link related to the use of a non-subject specific 6(12%)

general resource on the target web site.Multidisciplinary The two subjects were perceived to be dissimilar but the 4(8%)

link was created because of a subject connection.Computing help Primarily related to the use of computing technology in a 3(6%)

non-computing subject area, without a research element.Subject help The link target subject was providing information used by 2(4%)

the source but without a research element.Library/museum The connections were between library or museum pages. 2(4%)Recreational The link was associated with recreational material. 2(4%)Other Not classified in any of the above groups. 4(8%)


Classifications were made of the targets of random inter-university links. Standard links (i.e.,using the page ADM) were used instead of inter-domain links, in contrast to the previousstudy. The results (Figure 6.3) are broadly similar to those of the previous two reportedstudies, with category II, which includes computing and engineering, being clearly dominant.

Figure 6.3. Subject area classification of 312 random link targets (Thelwall & Price, 2003).

The distribution of broad subject areas can also be compared to subject sizes in universities,as measured by the total number of research-active faculty. Comparing Figure 6.4 with Figure6.3, area II is over-represented on the web, but in the other four areas the number of web sitesis approximately proportional to the number of research-active faculty. Actually, the arts andhumanities are over-represented as link targets on the web, for the number of research activefaculty. From Table 6.1, a logical explanation is the inclusion of library and informationmanagement within this group.

Figure 6.4. Relative sizes of research in the UK, using a nominal y-axis scale (Thelwall &Price, 2003).

As can be seen from the above reports, academic subject research has taken differentapproaches to the general web topic studies reported in the first half of this chapter. Table 6.3


summarizes the differences. The classification of sites rather than pages makes it possible forhumans to classify a significant proportion of the content of the university web sites in asingle country. The subject-classification research found that individual pages in sites weretypically very difficult to classify, often containing little or no indication of subject content. Inmany cases, the words found in the URL of a page are the best indicator of its content. Forexample, it could be correctly guessed that the subject of the page with URLhttp://www.colorado.edu/physics/2000/ is physics.

Table 6.3. A comparison of aspects of topic and subject identification studies

Area

ScopeDocument types

Classification schema

Classification methodDocuments to beclassified

TopicsWeb (OUT + SCC)

InternationalPages most likely to be visited bya random link walk.Existing directory topic structuresdmoz or Yahoo IDirectoryAutomaticPages

SubjectsUniversity web sites in a singlecountry (OUT + SCC)NationalVarious: random link targets(pages or sites), or linked pagesNational or internationalsubject classification schemesHumanDomain-based web sites orpages

COLINKS

In bibliometrics, indirect connections between documents have sometimes been investigatedinstead of direct links (citations). The most common is probably the case of two papers beingcited in the reference list of a third (co-citation). Motivated by this, and search engine use of asimilar idea to find related pages (Arasu, Cho, Garcia-Molina et al., 2001), one paper hasinvestigated whether indirect connections would be stronger indicators of subject similarity inacademic webs (Thelwall & Wilkinson, 2004). Recall that two web documents, e.g. pages orsites, are co-linked if a third document links to both, and co-linking if they both link to a thirddocument. In the UK, links and both kinds of colinks were all found to be approximatelyequivalent in their tendency to join pairs of documents within the same subject. Co-linked andco-linking pages were found to be numerically much more widespread than linked pages,however, making them more useful as evidence for the similarity of two web sites. Anunexpected finding was that high colink counts did not give a higher probability of subjectsimilarity.

SUMMARY

The content of the web has a strong computing orientation, with over 40% of pages in the IITstudy having been classified to the broad topic of computers. The results (Figure 6.1) alsogive a good overview of the other broad topics found on the web. Investigations into therelationship between content and links showed that web links tend to connect similar content,whether measured by specific topic categories or intrinsic page text similarity. In national setsof university web sites, some subjects publish on the web more than others, at least in terms


of linked pages or domains. Computing again stood out, but not by as much as in the generalweb, although this may be an artifact of the different classification systems. About a third ofinter-university links were found to connect similar subjects, so the link-content relationshipfound in the general web extends to academic web spaces and subjects. All of these findingsare limited to some extent by the web coverage strategies of the experiments that producedthem.

In terms of the inter-topic links, these are not random but follow identifiable patterns.For example, there are pairs of topics that tend to interlink, and other pairs where one tends tolink to the other. In academic webs the equivalent claim is also true: there are pairs of subjectsthat tend to interlink more than average. There are also subjects, like computing, that tend tobe linked to by other subjects, and specific causes of subject drift, including higher educationteaching resources.

FURTHER READING

Chakrabarti et al. (2002) give a matrix of estimated probabilities for intra-topic and cross-topic links for the twelve dmoz top-level broad topics, and the paper is well worth readingfrom start to finish. Bjorneborn's (2004) thesis is also revealing of causes of inter-subjectconnections and the relationship between subjects and links in academic webs. The paperscited in this chapter all contain additional details not mentioned here.

REFERENCES

Arasu, A., Cho, I., Garcia-Molina, H., Paepcke, A. & Raghavan, S. (2001). Searching theWeb, A CM Transactions on Internet Technology, 1(1), 2-43.


Brin, S. & Page, L. (1998). The anatomy of a large scale hypertextual web search engine.Computer Networks and ISDN Systems, 30(1 -7), 107-117.

Chakrabarti, S., Joshi, M., Punera, K. & Pennock, D. (2002). The structure of broad topics onthe web. Proceedings of the WWW2002 Conference. Available:http://www2002.org/CDROM/refereed/338/

Chakrabarti, S. (2003). Mining the Web. Morgan-Kauffmann, New York.Glover, E., Tsioutsiouliklis, K., Lawrence, S., Pennock D. & Flake, G. (2002). Using web

structure for classifying and describing web pages. WWW 2002. Available:www2002.org/CDROM/refereed/504/

Menczer, F. (2002). Growing and navigating the small world web by local content.Proceedings of the National Academy of Sciences, 99(22), 14014-14019.

Menczer, F. (2005, to appear). Lexical and semantic clustering by web links. Journal of theAmerican Society for Information Science and Technology.

Nentwich, M. (2003). Cyberscience: research in the age of the Internet. Vienna: AustrianAcademy of Sciences Press.

Thelwall, M., Harries, G., & Wilkinson, D. (2003). Why do web sites from different academicsubjects interlink? Journal of Information Science, 29(6), 445-463.

Thelwall, M. & Price, E. (2003). Disciplinary differences in academic web presence - Astatistical study of the UK. Libri, 53(4), 242-253.


Thelwall, M, Vaughan, L., Cothey, V., Li, X. & Smith, A. (2003). Which academic subjectshave most online impact? A pilot study and a new classification process, OnlineInformation Review 27(5), 333-343.

Thelwall, M. & Wilkinson, D. (2004). Finding similar academic Web sites with links,bibliometric couplings and colinks. Information Processing & Management, 40(3), 515-526.

Universities: Link Types 69

III ACADEMIC LINKS

UNIVERSITIES: LINK TYPES

OBJECTIVES

• To review the information known about the kinds of pages that are the source or targetof links in university web sites.

• To review the information known about the kinds of links found in university websites, and to explore the implications for interpreting link counts.

INTRODUCTION

Evaluating and interpreting counts of links in academic web spaces forms the core of thisbook and also informs the analyses of non-academic web spaces. This chapter and the nextdeal with national systems of university web sites: collections of all university web siteswithin a single country. Subsequent chapters look at departmental-level linking, internationallinks and journal-related links. This chapter addresses the core issue of why links betweenuniversity web sites are created. Early link analysis research drew inspiration frombibliometrics, and citation analysis in particular. A brief introduction to this field is thereforeuseful to set academic link analysis in context.

CITATION ANALYSIS

The fundamental belief underpinning most citation analyses is that, on average, citations to ajournal article are indicators of its value, often characterized as 'impact'. The theory behindthis belief was popularized by Robert K. Merton's (1973) sociology of science, in which a

7


citation to an article meant that the article's contribution to knowledge had been found usefulenough to be incorporated into further research. A high impact, valuable article was one thatwas highly cited. Similarly, a scientist or journal publishing important work could expect tobe highly cited. Conversely, an uncited article was probably irrelevant to the forward progressof science, as was an uncited academic or journal. It follows from this that the value of anarticle, scientist or journal can be assessed by counting citations. Unfortunately, however, thisdescription of Merton's theory is a considerable oversimplification of the complexity ofcitation practice. Two examples suffice to illustrate this point. An article can receive fewcitations despite making a significant contribution to research if it closes a research direction.The contribution of showing that an influential research idea does not work (e.g. cold fusion)might not get highly cited because it blocks future research in the same area. Conversely,many highly cited papers contain useful methods rather than ideas, so they are not cited fortheir direct contribution to understanding (Borgman & Furner, 2002).

The apparent contradiction between Merton's theory and reality can be resolved byaccepting that citation counts are, in general, indicators of research impact, but they arefallible and must be calculated carefully and at an appropriately high level of aggregation(Moed, 2002a; van Raan, 2000). For example, citations cannot be used to compare the impactof individual academics with any degree of reliability. They can, however, be used to comparewhole departments within the same discipline, but only if used in conjunction with experthuman judgment. The larger scale of analysis at the departmental level gives more chance forthe law of averages to operate and even out individual 'failures' of the citation-impact theory.

Ingwersen (1998) directly transferred citation theory to the web when he hypothesizedthat counts of links to a web site might be used to measure the online impact of that site. Foracademic web sites, this online impact might also be research impact. If this were true, thenlink counts could replace or supplement citation counts in research. But before this can beattempted, more needed to be found out about how link counts in various different academicweb spaces should be interpreted. This same issue has been addressed for citation counts andanswers sought in various ways, including categorization studies, correlation studies, andinterviews with authors. The first two of these have transferred to information science linkanalysis and can be found in the method overview given in chapter 1. Interviews with authorsare a logical addition, but are time-consuming and it may well be difficult to get useful resultsfrom them. The exception so far seems to be the study of Kim (2000), which is most relevantto journal links and is discussed briefly in chapter 11.

THE ROLE OF A UNIVERSITY WEB SITE

Before looking at types of pages and links it is useful to consider the role of university websites. Although some academic papers are published on university web servers, there is a lotof other information in university web sites that is not part of the academic publishing processand is not attempting to directly contribute to the progress of academic knowledge. This willbe common knowledge to anyone that has used a university web site, but is neverthelessimportant. Middleton, McConnell, and Davidson (1999) have proposed, "a model for thestructure and content of a university [web] site", claiming that it is in the interests of auniversity to provide three different types of information.


• Promotional information: advertising services, assets and achievements topotential customers, collaborators and recruits (recruits being both staff andstudents).

• Value-added information: providing genuinely useful services to people,encouraging their return and enhancing the institution's reputation as aninnovative information provider

• Utility: to staff and students: information, services and resources that willenable an institution to reach it strategic aims more easily, facilitate externaland internal communication and enhance education. This may have theadditional benefit of impressing potential customers and recruits,demonstrating the facilities which will be available to them, should theychoose to come to the institution.

(Middleton, McConnell & Davidson, 1999)

Self-promotion is an important part of the wider process of research, even for individualacademics (Hyland, 2003). In other words, departments that do not publicize their work to amore general audience than those that read their published articles are not optimizing theirresearch 'in the round', and may loose out on such things as contacts with industry and newstudents. The same study also claims that institutions should provide:

...space for scholarly use and learning new ways of exploiting the newmedium. Everybody within the institution should have the ability/opportunityto feed into the web site, provided sufficient editorial guidelines are in place.This promotes a vibrant web culture that encourages usage - seen by many asthe raison d'etre of the Internet.

(Middleton, McConnell & Davidson, 1999)

Again, this contribution is unlikely to be surprising for many readers, but it does emphasizethat experimentation and variety are natural to university web sites.

NATIONAL SYSTEMS OF UNIVERSITY WEB SITES

Much information science link analysis has focussed on the university web sites of a singlecountry. University web sites are a logical object of a long-standing interest in various aspectsof scholarly communication (Borgman, 1990). The restriction to a single country is a practicalstep, to get a data set that is not too large to be collected. Most countries seem to have at leastone university and larger, developed nations have an organised system of higher educationwith perhaps hundreds of individual institutions. Higher educational institutions are oftencalled universities, or a very similar word in different languages, but other terms also usedinclude variations of 'academy', 'polytechnic', 'technical university', 'technical high school','institute' and 'college'. Depending upon the country, these names may or may not reflectdifferences of status, for example concerning the ability to confer higher degrees.

Each national higher education system seems to be unique (e.g., Knudsen, Haug &Kirstein, 1999). Common differences are in the status of different types of vocational orprofessional education, the funding and organisations of research, and the uniformity ofuniversities. For the organisation of research, an important difference is the extent to which anation's government-funded research is conducted in universities instead of specialist


research organisations (e.g. the Max Planck Institutes in Germany). For uniformity ofuniversities, the Netherlands and the UK are perhaps two extremes, with Dutch universitiesbeing very similar in status (although some claim pre-eminence) and UK universities having awide range of degrees of research-orientation. The US higher education system is unusual onan international scale for the strength of its private sector universities and its relativelyunregulated and varied market-driven higher education system (Graham & Diamond, 1997).

For a link analysis of universities, the restriction to a single country has the advantagethat the task of gaining specialist knowledge of the higher education system is manageable.Multi-country investigations run the risk of comparing dissimilar institutions withsuperficially similar names. Moreover, research assessments are needed for correlation testing(<chapter4, >chapter 8) and national statistics are unlikely to be comparable.

One additional restriction has also been almost universal: the choice to consider onlylinks between different university web sites, i.e. excluding site selflinks (see chapter 3 formore on self-link exclusion). Bar-Ilan's (2004a) case study verifies this difference betweensite selflinks and inter-site links in (Israeli) university web sites, although finding a significantoverlap between the link types found in both contexts. In this chapter, as in the rest of thebook, unless explicitly stated all links are inter-university links.

PAGE TYPES

This section reviews research concerned with the types of page that are the source or target ofinter-university links. In citation analysis, citations are normally analyzed wholesale, with theimplicit model that all citations are of approximately equal value. There are some exceptions.For example, disciplinary differences are recognized as being important, therefore disciplinesare normally analyzed separately; review articles are known to attract a disproportionate shareof citations and are therefore sometimes excluded or treated differently (Borgman & Furner,2002). In non-citation research evaluation exercises, publications may be differentiatedaccording to journal quality, such as valuing articles in international journals higher thanarticles in national journals (Moed, 2002b). Nevertheless, the vast majority of articles wouldprobably be seen as attempting to contribute to the progress of academic knowledge in adirect way. The situation is very different for university web sites. Since many university websites are not regulated to any significant extent and contain a variety of different information,it is a real challenge to put together a meaningful general description of their contents.Currently, the most systematic attempt to do this is Bar-Ilan's (2004b) link classificationstudy.

Table 7.1. Page types for Israeli inter-university links (Bar-Ilan, 2004b).Page typeList of resources, bookmarks or non-coherent listBelongs to/describes an entityBelongs to/describes a personTextual resource (not a list)Textual resource (could be a list)Describes a serviceEventNon-textual resource

Link source pages50%

8%10%23%

-1%8%

-

Link target pages-

49%29%

-17%10%

2%1%


Bar-Ilan's (2004b) investigation includes categories for the type of source and target page ofinter-university links in Israel, summarized in Table 7.1, and page intentions, summarized inTable 7.2. This is a seven-faceted scheme, with another facet being page ownership (e.g.,individual, entity), and is easily the largest (1,332 pages) and most comprehensive inter-university link classification exercise so far. More detailed descriptions of the elements of thescheme can be found in a related paper (Bar-Ilan, 2005).

Table 7.2. Page intentions for Israeli inter-university links (Bar-Han, 2004b).Page intention Link source pagesProfessional (work-related)EducationalResearch orientedPersonalAdministrativeGeneral/informativeSocialTechnical

31%23%19%9%8%4%3%2%

Links between UK universities have been analyzed with a three-faceted link source and targetcategorization scheme (Harries, Wilkinson, Price, et al., 2004). Only links originating inmaths, physics and sociology were analysed. The facets are shown in Figure 7.1, Figure 7.2and Figure 7.3.

Figure 7.1. Page content types for link sources and target pages associated with maths,physics or sociology in UK university web sites (Harries, Wilkinson, Price, et al., 2004).

74 Link A?ia!ysis: An Information Science Approach

Figure 7.2. Page genre types for link sources and target pages associated with maths, physicsor sociology in UK university web sites (Harries, Wilkinson, Price, et al., 2004),

Figure 7.3. Owner categories for link sources and target pages associated with maths, physicsor sociology in UK university web sites (Harries, Wilkinson, Price, et al., 2004).

Tables 7.1 and 7.2 and figures 7.1 to 7.3 give useful information about the types of pagesassociated with links in web sites. They are not directly comparable: the Bar-Ilan study isgenuinely representative of inter-university links (as indexed by FAST/AIITheWeb) whereasthe other study is the aggregation of different sources and should be taken as giving only veryapproximate guidelines. It is important that the pages classified are related to links becausethere are presumably large numbers of pages in a university that are irrelevant for linkanalysis purposes. For example it would be self-defeating for an online prospectus tofrequently link to other universities.


From Table 7.1 the importance of link lists is very clear and this is an immediatedifference from citations. There are also a significant number of pages that belong to ordescribe a person or entity, another difference. Since only 23% of link source pages and 17%of link target pages could be described as a "textual resource" in any way, links typicallyoccur in very different contexts to citations. This is echoed in Figure 7.2, although acomparison with Figure 7.1 shows that these non-text resources can still be related toresearch. For example, a subject-based link list, although not creating new information, canstill provide a useful resource for researchers or students. Also of importance from Figure 7.3,the owners of link source and target pages are typically organizations (departments, researchgroups or the university itself) rather than individual academics. This is echoed in Bar-Ilan's(2004b) source page ownership facet, which judged a minority of 30% of link source pages tobe published by individual academics.

In summary, it seems fair to describe link source and target pages as containing manylink lists and being very entity-centered (persons, research groups, departments, universities).These pages have academic functions, but typically very different functions from, say, journalarticles. For example, departmental home pages will presumably give a broad range ofinformation about a department. Departmental home pages do not normally directlycommunicate research findings, but support the process of research and education in otherways, including with publicity. Some pages may be characterized as instances of informalscholarly communication, for instance subject-based link lists. Textual pages, if describing thefunction of a research group or department could also be described as a form of informalscholarly communication, although this communication is not a direct part of producing orcollaborating about research.

LINK TYPES

The types of links created between university web sites are important. If all linkscommunicate research, it would be reasonable to borrow Merton's citation theory and claimthat link counts measure online research impact. But if all links communicate cookery recipesthen the appropriate claim would be for culinary skill. In this section, investigations into typesof links in academic web spaces are reviewed in an attempt to decide what link counts couldmeasure. This does not directly address the question of why links are created, but addressesthe related question of what identifiable types of link are created.

Early studies of the types of links in national systems of university web sites foundimmediate problems with using link counts to measure online research impact. Smith (1999),for example, found Australasian university web site inlink counts in 1998 to be sometimesdominated by links to individual resources, and in the UK Midlands a list of curry restaurantswas the 13th most popular link target (Thelwall, 2001). A later, larger-scale study of the mosthighly linked to pages in all UK universities found different results, however (Thelwall,2002). There were no recreational pages in the top 100, perhaps reflecting a maturing of theacademic web and increasingly serious and professional uses. The top 100 was insteaddominated by university home pages. Links to university home pages are difficult to interpretwithout seeing the link source page because university home pages are normally almostcontent free. Their main purpose is to serve as a gateway to the rest of the site. In fact, afollow-up study found that many links to university home pages seem to be created withoutany intention that they be used. For instance, they could be created as part of an exercise inweb page design or as an acknowledgement that the target university had helped create the


contents of the source page (Thelwall, 2003a). Also in the top 100 inlinked UK universitypages were several targets of automatically generated links in web page creation software.One was caused by the LaTeX2HTML program that converts documents from themathematical typesetting language LaTeX to web pages, inserting at the end of each page alink to the home page of LaTeX2HTML's inventor.

A second experiment was again based upon inter-university links in the UK, but thistime individual links were classified rather than the top link targets (Wilkinson, Harries,Thelwall & Price, 2003). A random procedure was chosen to select links from a databasecontaining crawl data for 107 UK universities in July 2001. The links were selected in such away that approximately the same number were taken from each university's pages. Aclassification scheme was then developed and 414 of the links independently classified bytwo researchers based upon the source page, the context of the link in the source page, and thetarget page. The results are in Table 7.3 for the 294 links about which agreement was reached.Despite the classification scheme having been jointly devised and tested by the researchers,there was a considerable level of disagreement in the results, probably because in many casesthe classifications were genuinely difficult to make.

Table 7.3. Common intentions for UK inter-university links (Wilkinson et al., 2003).

The combination of source and target page for a link allowed links to be classified when thetarget page alone would have given little idea of why it could have been inlinked. Forexample, one learning technology page linked to a university home page, citing it as anexample of another institution undertaking similar research. This link plays the role ofacknowledgement, or assigning credit to the target university. It is clearly related to research,although not directly linking to a research description.

Bar-Ilan (2004b) categorized 1,332 Israeli inter-university links for their apparentintentions (Table 7.4), using a different set of categories. Tables 7.3 and 7.4 have comparabledata sources, although there are differences in the methodologies for finding the links (thefirst uses a crawler, the second a commercial search engine). A discussion of these twostudies can give important insights into inter-university link types.

Note that Bar-Ilan's 'Professional' category (cf. Bar-Ilan, 2005), when applied topages included departmental home pages, as well as personal home pages lacking researchinformation. Both of these could have been part of research related links (e.g.acknowledgement links) in the Wilkinson et al. (2003) scheme. The difference is probablydue to Bar-Ilan's orientation on link content from an information retrieval perspective, incomparison to the scholarly communication orientation of Wilkinson et al. (2003).

Reason for link NumberInformation for students 25%Research support and resources 23%Libraries & e-journals 2 1 %Recreational 9%Page creator or sponsor 7%Similar department 7%Research partners 3%Student learning material 2%Tourist information 1%Research reference 1%


Table 7.4. Common intentions for Israeli inter-university links (Bar-Ilan, 2004b)Link intentionProfessional (work-related)Research orientedGeneral/informativeEducationalSuperficialTechnicalSocialAdministrative

Links31%20%11%9%8%8%5%2%

Although the categories in tables 7.3 and 7.4 are different, some conclusions can be drawnand two extreme possibilities can be discarded. First, links are not typically frivolous inpurpose: recreational pages form only 9% in the UK, and social and superficial links in Israeltotal to 13%. Second, links are not typically citations: these account for 1 % in the UK, and aretherefore presumably a small part of the 'research oriented' category in Israel. Between thesetwo extremes, the majority of pages (probably at least 86% in both countries) play some rolein research or education, but falling short of formal scholarly communication, i.e. citations. Acount of inter-university links therefore represents a wide range of research and educationalactivities. The exact proportion of links that are directly research-related, rather thaneducational, is difficult to assess from tables 7.3 and 7.4. From Table 7.4, only 20% aredirectly research-related, but in Table 7.3, a combined total of 27% is in the categoriesresearch support and resources, research partners, and research reference. The classifiers forTable 7.3, by analyzing the links that had produced different categorizations, agreed thatabout 90% of links were related in some way to scholarly activity, including education. Ofcourse, there is no clear divide between research and education; the two are interrelated,although more closely in some fields than others.

The central difficulty with interpreting link types is the wide range of differentcontexts in which links can be created. This is an area of link research that has probably beenunder-theorized. Before summing up the discussion, Table 7.5 is presented to illustrate arange of linking contexts. Although the examples given are real, the table offers an intuitivelychosen selection to help the reader put this discussion into a more concrete perspective.

SUMMARY

Links between universities are neither irrelevant (although a minority are created forrecreational purposes) nor equivalent to citations (although about 1 % are). Links are part ofresearch and education but not a core part of the communication of scientific advances in thesense of Merton (1973). This means that link counts should not be used to assess researchimpact. Nevertheless, the web is important for publicizing and promoting research entities,including individuals, research groups, departments, and universities. It is also important forproviding teaching material and pointing students to a wide range of information relevant totheir studies. Link counts, then, represent a wide range of research and educational activities.


Table 7.5. Some linking contexts.ContextResearch-based link lists

Teaching link list

General link listAcknowledgement sectionsin web pages, e.g.collaborative project webpages

Link in academicjournal/conference paperPersonal pages

Teaching page

Computing help page

Possible targeting reasonsTarget information quality and relevance/usefulness forresearch.Target information quality and relevance/usefulness forteachingTarget information usefulnessSignificant relationship between the source page owner and aperson or group in the organization owning the target page (e.g.linking to the institutional home page of a project partner). SeeCronin, Shaw and La Barre (2003) for an academicacknowledgement typology.Typical range of citation motivations (Borgman & Furner,2002) as well as online additions (Kim, 2000) (>chapter 11).Collaborator's personal page, previous employers,undergraduate college.Location of resource with relevant and well-explainedinformation for a specific topic.Online manual or help for computing.

If university inlink counts assess anything, then the results presented here suggest that theycould measure the extent to which a university is effectively and visibly publishing a widerange of research and education-related material online. If university inlink counts are anindicator (>chapter 24), then they could perhaps be called an indicator of university webpublishing health. This description still does not reflect that many links are created as anacknowledgement, completely independent of target page contents. Such links are notindicative of publishing health, but of research health. The next chapter continues thediscussion about interpreting link counts.

FURTHER READING

The study of Bar-Ilan (2004b) should be read in full for additional information about theclassification exercise, particularly the different facets. The related paper Bar-Ilan (2005)should be consulted in conjunction with this for more information about the definitions usedin the classification scheme. That of Harries, Wilkinson, Price et al. (2004) is consulted againconcerning the issue of departmental linking (>chapter 10), and the article itself containsdescriptions of the categories reported in the tables.

Readers may wish to compare the results of this chapter with some citation contextand motivation studies (Chubin & Moitra, 1975; Oppenheim & Renn, 1978) and the review ofBorgman and Furner (2002).

REFERENCES

Bar-Ilan, J. (2004a). Self-linking and self-linked rates of academic institutions on the web.Scientometrics, 59(1), 29-41.


Bar-Ilan, J. (2004b). A microscopic link analysis of universities within a country - the case ofIsrael. Scientometrics, 59(3), 391-403.

Bar-Ilan, J. (2005, to appear). What do we know about links and linking? A framework forstudying links in academic environments. Information Processing & Management.


Borgman, C. (1990). Scholarly communication and bibliometrics. California: Sage.Chubin, D. & Moitra,, S. (1975). Content analysis of references: adjunct or alternative to

citation counting? Social Studies of Science, 5, 423-441.Cronin, B., Shaw, D. & La Barre, K. (2003). A cast of thousands: Coauthorship and

subauthorship collaboration in the 20th century as manifested in the scholarly journalliterature of psychology and philosophy. Journal of the American Society forInformation Science, 54(9), 855-871.

Graham, H. D. & Diamond, N. (1997). The rise of the American research universities.Baltimore, MD: The Johns Hopkins University Press.

Harries, G., Wilkinson, D., Price, E., Fairclough, R. & Thelwall, M. (2004, to appear).Hyperlinks as a data source for science mapping, Journal of Information Science 30(5).

Hyland, K. (2003). Self-citation and self-reference: credibility and promotion in academicpublication. Journal of the American Society for Information Science, 54(3), 251-259.

Ingwersen, P. (1998). The calculation of Web Impact Factors. Journal of Documentation,54(2), 236-243.

Kim, H. J. (2000). Motivations for hyperlinking in scholarly electronic articles: A qualitativestudy. Journal of the American Society for Information Science, 51(10), 887-899.

Knudsen, I., Haug, G. & Kirstein, J. (1999). Trends in learning structures in HigherEducation. Available: http://www.bologna-berlin2003.de/pdf/trend_I.pdf

Merton, R. (1973). The sociology of science. Theoretical and empirical investigations.Chicago: University of Chicago Press.

Middleton, I., McConnell, M. & Davidson, G. (1999). Presenting a model for the structureand content of a university World Wide Web site, Journal of Information Science,25(3), 219-227. Available: http://www.abdn.ac.uk/~coml34/publications/jisl999.shtml

Moed, H.F. (2002a). The impact-factors debate: The ISI's uses and limits. Nature, 415,731-732.

Moed, H.F. (2002b). Measuring China's research performance using the Science CitationIndex, Scientometrics, 53(3), 281-296.

Oppenheim, C. & Renn, S. (1978). Highly cited old papers and the reasons why they continueto be cited. Journal of the American Society for Information Science and Technology,29(5), 225-231.

Smith, A.G. (1999). A tale of two Web spaces: Comparing sites using web impact factors.Journal of Documentation, 55(5), 577-592.

Thelwall, M., Vaughan, L. & Bjornebom, L. (2005, to appear). Webometrics. In: AnnualReview of Information Science and Technology 39.

Thelwall, M. (2001). Results from a Web Impact Factor crawler, Journal of Documentation,57(2), 177-191.

Thelwall, M. (2002). The top 100 linked pages on UK university web sites: high inlink countsare not usually directly associated with quality scholarly content, Journal of InformationScience, 28(6), 485-493.


Thelwall, M. (2003a). What is this link doing here? Beginning a fine-grained process ofidentifying reasons for academic hyperlink creation, Information Research, 8(3), paperno. 151. Available: http://informationr.net/ir/8-3/paperl51.html.

Thelwall, M. (2003b). Web use and peer interconnectivity metrics for academic Web sites,Journal of Information Science, 29(1), 11 -20.

van Raan, A.F.J. (2000). The Pandora's box of citation analysis: Measuring scientificexcellence - the last evil? In: Cronin, B. & Atkins, H.B. (Eds.). The web of knowledge:a festschrift in honor of Eugene Garfield. Metford, NJ: Information Today Inc. ASISMonograph Series, 301-319.

Wilkinson, D., Harries, G., Thelwall, M. & Price, E. (2003). Motivations for academic Website interlinking: Evidence for the Web as a novel source of information on informalscholarly communication, Journal of Information Science, 29(1), 59-66.

Universities: Link Models 81

8

UNIVERSITIES: LINK MODELS

OBJECTIVES

• To review findings about numerical relationships between research and links.• To describe simple mathematical and logical inter-university linking models.

INTRODUCTION

The link categorization results reported in the previous chapter are inconclusive with respectto the central question of what links counts measure. They certainly do not measure directknowledge transfer within the core of research: very few links are equivalent to journalcitations. Nevertheless, the vast majority of inter-university links seem to relate to scholarlyand educational activity, albeit in a wide variety of ways. At a very general level it isreasonable to hypothesize that university web site inlink counts may measure the extent towhich its scholars are able to effectively engage in web-based academic publication. For thisapproach, the minority of links created for acknowledgement or for recreational reasons areregarded as having an insignificant influence.

The next stage in the assessment of link counts is to compare them statistically withother metrics of known value. This is standard practice when assessing any new kind ofindicator (Oppenheim, 2000). If link counts can be shown to correlate strongly with anestablished measure, such as one of research performance, then this would be (a) conclusiveevidence that links are not created completely at random, and (b) corroborative evidence of aconnection between research performance and link counts. Recall that correlation statistics donot give evidence of causation, however (<chapter 4).

THE RELATIONSHIP BETWEEN INLINKS AND RESEARCH

One link count correlation investigation is reported here in detail and other complimentaryones are discussed more briefly afterwards. This study compared counts of links to UKuniversities with a measure of the universities' research productivities (Thelwall & Harries,2004). The UK was chosen because it has the most extensive research assessment in the


world, one that is controversial for its cost, at 1 % of the funding allocated, and effort (Adams,2002). The UK's research statistics are therefore the best available benchmarks for universityperformance. The figures used are from the 2001 Research Assessment Exercise (RAE), andconsist of up to 68 grades per university (www.rae.ac.uk, Mayfield University Consultants,2004). Grades are awarded by 66 subject-based panels (two pairs of subjects have combinedpanels) on a seven point scale, and decided mainly through peer review of the best fourpublications of each researcher submitted for assessment. These grades can be combined intoan overall research productivity rating for each university by linearizing them and thentotaling the grades awarded for each active researcher. It is reasonable to characterize theresulting figures as university research productivities, even though they actually conflateresearch quality and quantity. UK newspapers routinely perform this calculation and thendivide the result by the total faculty in a university, obtaining an average research score perfaculty member for each university that it is reasonable to describe as average researchproductivity.

Average university research productivity, calculated as above, has a wide range: from0.4 to 6.6. UK universities exhibit a broad spectrum of research capabilities, from those thathave a high international reputation in many subjects, to those that conduct very little researchat all. The range in the UK increases the power of statistical tests.

The inter-university link count statistics used in the study were extracted from a linkstructure database created by a version of SocSciBot crawling 111 UK universities in July2002 (Thelwall & Harries, 2004). The data were analyzed using two different AlternativeDocument Models (<chapter 3), the domain ADM and the page ADM. The domain ADM wasargued to be the more valid model because of the known link anomalies between UKuniversities (Thelwall, 2002a). The correlation test performed was of average inlinks perfaculty member against average research productivity. The reason for comparing twoindicators that have been divided by faculty numbers is to ensure that both are normalized forsize. Bigger universities could be expected to conduct more research and attract more inlinksand so a correlation between total research productivity and total inlinks could be explainedthrough both being related to university size. After normalizing for size, however, anotherexplanation must be sought for any significant correlation found.

Figure 8.1 shows the relationship between inlink counts per faculty member andresearch productivity per faculty member for UK universities. A statistically significantcorrelation is present (Spearman's rho: 0.784, n=ll l , significant at the 0.1% level). Inconjunction with the link type classification results described above, this suggests, but doesnot prove, that research activity attracts links. It does not explain the large percentage ofeducational-related, rather than research-related links, however (<chapter 7).

A limitation of the UK study is its restriction to one country. There is no logicalreason why all other countries should follow the UK model, particularly because the USA isthe world web leader, and because of language barriers to imitation in any case. Investigationsinto research-link relationships in other countries are hampered by the lack of authoritativeand universal research assessment results, but citation counts, using Institute for ScientificInformation (ISI) data are a logical alternative. Table 8.1 summarizes the results of variousnational investigations. Research-link relationships have also been discovered in the USA fordepartments, as discussed later (>chapter 10).


Figure 8.1. Domain inlinks per faculty member against average research productivity for 111UK universities (Thelwall & Harries, 2004).

Table 8.1. Link-research relationships in different countries.CountryCanada

China

Taiwan

Australia

UK

Research IndicatorScience grants or faculty awards

Composite academic score(NetBig magazine rankings)Citations in the Science CitationIndexGovernment composite academicscore (Research Quantum)National peer review (RAE)

OutcomeSignificant correlation with inlinks(Vaughan & Thelwall, 2005)No significant correlation withinlinks (Thelwall & Tang, 2003)Significant correlation with inlinks(Thelwall & Tang, 2003)Significant correlation with inlinks(Smith & Thelwall, 2002)Significant correlation with inlinks(Thelwall & Harries, 2004)

Figure 8.2 shows the relationship between total research productivity and total inlink countsfor the UK. The strong linear trend is misleading about the underlying research-linkconnection because the size factor has not been removed, and size is implicated both in totalinlink counts and total research productivity. This relationship is complicated by the trend foruniversities with higher research ratings to be larger.


Figure 8.2. Domain inlinks against research productivity for 108 UK universities (Thelwall,2002a).

ACADEMIC LINKING: QUALITY VS. QUANTITY

One logical explanation of the trend found in Figure 8.1 is that the pages of 'better'researchers (i.e. with higher research scores) attract more inlinks. This hypothesis was testedfor links within the UK data set, and found to be incorrect (Thelwall & Harries, 2004). Theaverage number of inlinks per page (or per domain using the domain ADM) is approximatelyconstant. This apparent contradiction between higher average research productivityuniversities attracting more inlinks, and their researchers' pages and domains not attractingmore inlinks can be resolved because better researchers produce more web pages (anddomains, see Figure 8.3). They attract more inlinks in total through having more pages (anddomains) even though their average inlinks are approximately the same. The linking model inFigure 8.4 summarizes this. This is a logical model in the sense of illustrating the logic ofcausation. Later in this chapter, mathematical models are also used to represent abstractnumerical relationships in the data.


Figure 8.3. Domains per faculty member against average research productivity for 111 UKuniversities (Thelwall & Harries, 2004).

Figure 8.4. The quantity-based logical linking model.

An important consequence of the linking model in Figure 8.4 is that counts of links to auniversity do not seem to be directly measuring the quality of its web publishing. It seems thatquantity is the key factor. This finding probably contradicts the intuition of many webauthors, particularly those that always use information quality to decide where to link, but thisdoes not invalidate the statistical average. It also contradicts many linking assumptions, suchas the assumption in PageRank that link counts reflect page popularity (>chapter 12), and thatlinks are not given indiscriminately to pages, but are genuinely useful indicators of targetpage quality. It is possible, however, that better researchers do tend to produce pages anddomains that attract high numbers of inlinks, but that they also tend to produce large numbersof pages that attract few inlinks, giving similar average inlinks to other universities. Thispossibly is not strongly supported by the data, although it is difficult to definitively discreditthe idea because of the difficulty in performing reliable statistical tests on power law data(Thelwall & Harries, 2004).

In summary, and a key finding for academic link analysis: universities with higherresearch productivity per faculty member attract more inlinks per faculty member. This doesnot seem to be because more productive universities attract more inlinks per page or domain,but because they publish more pages and domains.


ALTERNATIVE LOGICAL LINKING MODELS

Although the results of the correlation tests above suggest that research attracts links, thisdoes not entirely fit with the results of the link categorization studies (<chapter 7), because themajority of links are not directly related to research, many relate more directly to education.Some of the education-related links may also be research-related because research can feedinto education and so there can be a considerable overlap between the two. Figure 8.5 is analternative causative model of university inlinks. It expresses the possibility that betterresearchers may not only produce more research-related web pages, but may also producemore educational web pages. This is consistent with both the correlation results and the broadlink classification results. But why should better researchers create more educational webpages? There are at least three possible explanations. First, the overlap between research andeducation may lead to educational pages being created as a spin-off of research. Second,creating web pages is an important activity for publishing research, and so a proportion ofresearchers probably learn web publishing or at least gain access to others' web publishingskills for use in research. This may make it easier for the same researchers to produceeducation-related web pages. Third, better researchers may have more time to spend onteaching or spend more of their time on work-related activities. Of course, there are alsoarguments in the contrary direction: perhaps some better researchers concentrate on researchat the expense of their teaching and spend less time on educational web publishing.

Figure 8.5. The web publishing productivity link model.

A weakness of the logical model in Figure 8.5 can be seen from a more detailed investigationof the link typologies in the previous chapter. Some of the links target quite general resources,such as digital libraries, and organizational entities, such as departments, where informationprofessionals such as librarians and web masters may well be the creators of the pages. Betterresearchers may tend to be in richer institutions that can afford to spend money on theinfrastructure to support efficient web publishing. Figure 8.6 extends Figure 8.5 to reflect thispossibility.

Figure 8.6 has not been empirically verified. Although no evidence has been presentedto show that richer universities tend to better fund web-related infrastructure, the model isconsistent with all of the sources presented, and parsimonious in the sense that argumentshave been presented against simpler models. Nevertheless, the model presents two separatepaths to attract links, but without evidence to suggest the relative importance of the two. Anattempt to resolve this issue would need to assess the relative importance of academics andinfrastructure in the creation of the inlinked web pages.


MATHEMATICAL MODELS

Linking has, so far, been discussed only in relation to counts of links to universities, and notcounts of links from universities. This is because inlinks are more useful as indicators thanoutlinks, because outlinks are under the control of the site owners (i.e., created by them). Anadditional technical problem with site outlink counts is that they depend upon a single sitecrawl, and are therefore more liable to crawler coverage problems than inlink counts, whichare totaled from a number of different crawls. For example, if one site is not covered well by acrawler because an important area of the site has pages in a format that cannot be crawled,then this will have a big impact upon the outlink count for that site, but only a small impact onthe inlink count of all other sites, which will loose the inlinks that were missed from the badlycrawled site. A study of outlink counts coined the phrase 'Web Use Factors' (WUFs) foroutlinks divided by faculty numbers (Thelwall, 2003b). Web Impact Factors (WIFs) hadpreviously been defined to be inlinks divided by faculty numbers (Ingwersen, 1998; Thelwall,2001). WUFs were not found to be statistically less reliable than WIFs, which wasunexpected. Web Use Factors correlated strongly with average research productivity statistics.More productive universities produce more outlinks, presumably because they produce morepages. This ties in neatly with the inlinking explanation.

The combination of inlink and outlink explanations suggests that the number of linksbetween a pair of universities would be proportional to the product of their researchproductivities, and this is indeed true. It was established by an experiment that compareddifferent combinations of source and target university average research productivity and size


(faculty numbers), finding the best predictor to be the quadruple product of the source andtarget university average research productivity and faculty numbers (Thelwall, 2002c). This isformula 8.1 below, where LAB is the number of links from university A to university B, RA is ameasure of the average research quality of university A's faculty, and SA is the number offaculty in university A, and similarly for university B. Kcountry is a country-dependant constant.In fact Kcomlry will depend both upon the country and the research measure used.

LAB = Kcounlry RASARBSB (8.1)

For the UK, the value of Ka>mlry was found to be KUK = 0.000,000,013 for the standard pageADM (Thelwall, 2002c). Similar models can be defined for the total inlinks to a university 1Aand total outlinks from a university OA, and since inlinking and outlinking are symmetrical inall of the explanations given, the two models are the same and have the same constant, whichwill be called Ccomtry.

IA = Ccountry RASA (8.2)

OA = Country RASA (8.3)

In all three of these models, following Figure 8.6, the dependant variable could also be thefunding available, i.e. replacing RASA with a single variable for university funding, say FA.More complex models are also possible, perhaps with separate variables for research pages,institutional pages, education pages and other pages. One Canadian study has separated aneducation-related variable from a research variable, but did not find a statistically significanteducation input (Vaughan & Thelwall, 2005). The educational variable used was a measure ofstudent quality (rather than educational quality): the number of national student awards wonper 1000 students. A logical future direction for research is to test different models to assessthe balance between research, education, institutional web support and funding. For thegreatest statistical power, this would work best in a country with good research and educationassessment indicators and where universities often excel at only one of the two.

THE INFLUENCE OF GEOGRAPHY

In the early days of the web, there were many predictions about how cyberspace would beused and some claims about how it had created a space that was divorced from the real world(e.g., Negroponte, 1996). In support of this claim, the technical effort to create a link in a webpage is the same, irrespective of whether the inlinked page is in the next town or on the otherside of the globe. It is possible to imagine difficulties if the URL is in a different character setor the target page is in a different language, but distance itself is not a factor. This is incontrast to letters, for example, which take longer and are more expensive to send over longdistances. Nevertheless, distance does influence academic linking patterns, even foruniversities within the same country. An analysis of the impact of distance on link creation inthe UK showed that neighboring institutions were much more likely to interlink than distantones (Thelwall, 2002b), as shown in Figure 8.7. Note that the units on the vertical axis arenormalized average link counts between pairs of universities at the specified distance apart.


They are normalized for the expected link count between the pair, using equation 8.1 above topredict link counts from faculty numbers and average RAE research scores.

Figure 8.7. Average minimum link counts between UK universities.

The geographical trend shown in Figure 8.7 can tie in with the link typologies (<chapter 7)and the logical academic linking model (Figure 8.6). Although web authors undoubtedlycreate some links after worldwide searches for online information about a topic, others aremore of a by-product of normal research activities, such as collaboration. Collaboration isaffected by distance in some countries, including the UK (Katz, 1994), and so it is logical thatthis should be reflected in link creation. To give a more concrete example, there are manyregional scholarly organizations, such as the North British Functional Analysis Seminar.Affiliated departments link to an official seminar home page hosted by one of them, resultingin a local cluster of links. A second type of example is that some nearby universities shareresearch, teaching and/or library facilities, which can generate additional interlinking.

Geographic factors have also been found in Canada (Vaughan & Thelwall, 2005), withan indication that universities within 3,000 km of each other are much more likely to interlinkthan more distant universities. This probably reflects an East-West divide in Canadian society,rather than the more gradual change evident for the UK.

REGIONAL GROUPINGS

Figure 8.8 is a graph of the interlinking between UK universities, using raw link counts, andInternet domain name abbreviations for university names; add .ac.uk to the end to get theirweb site domain name. The diagram is a maximal spanning tree (Chen, 1999), which is aheuristic to produce diagrams with minimal numbers of connecting lines. It allows a morelocalized analysis of linking than the general model given in Figure 8.7. This model uses rawlink counts and has not factored out expected link counts, and so the trends present are due toboth geography and research productivity. Nevertheless, some regional groupings highlighted


by dotted lines can be seen, and particularly the separation between the four countriescomprising the UK.

Figure 8.8. A maximal spanning tree for correlation coefficients between profiles of links toBritish universities (Thelwall, 2002d).


SUMMARY

Despite the fact that link counts can be predicted for universities based upon their researchperformance, link counts are not often a direct outcome of research or a direct indicator ofresearch impact or performance. The funding link model, Figure 8.6, has been proposed as anexplanation for the numerical relationships found between research and links. Mathematicalmodels of linking have also been described and the existence of regional and geographiclinking trends have been illustrated.

REFERENCES

Adams, J. (2002). Research assessment in the UK. Science, 296, 805.Chen, C. (1999). Visualising semantic spaces and author co-citation networks in digital

libraries. Information Processing and Management, 35(3), 401 -420.Ingwersen, P. (1998). The calculation of Web Impact Factors. Journal of Documentation,

54(2), 236-243.Katz, J.S. (1994). Geographical proximity in scientific collaboration. Scientometrics, 31, 31-

43.Mayfield University Consultants (2004). League tables. Times Higher Education Supplement,

1,641 (May 21), 10-15.Negroponte, N. (1996). Being digital. New York: Vintage.Oppenheim, C. (2000). Do patent citations count? In: Cronin, B. & Atkins, H. B. (eds.). The

web of knowledge: a festschrift in honor of Eugene Garfield. Metford, NJ: InformationToday Inc. ASIS Monograph Series, 405-432.

Smith, A.G. & Thelwall, M. (2002). Web Impact Factors for Australasian universities,Scientometrics, 54(3), 363-380.

Thelwall, M., & Harries, G. (2004). Do better scholars' web publications have significantlyhigher online impact? Journal of the American Society for Information Science andTechnology, 55(2), 149-159.

Thelwall, M. & Tang, R. (2003). Disciplinary and linguistic considerations for academic weblinking: An exploratory hyperlink mediated study with Mainland China and Taiwan,Scientometrics, 58(1), 153-179.

Thelwall, M. (2001). Results from a Web Impact Factor crawler. Journal of Documentation,57(2), 177-191.

Thelwall, M. (2002a). Conceptualizing documentation on the Web: an evaluation of differentheuristic-based models for counting links between university Web sites. Journal of theAmerican Society for Information Science and Technology, 53(12), 995-1005.

Thelwall, M. (2002b). Evidence for the existence of geographic trends in university web siteinterlinking, Journal of Documentation, 58(5), 563-574.

Thelwall, M. (2002c). A research and institutional size based model for national universityweb site interlinking. Journal of Documentation, 58(6), 683-694.

Thelwall, M. (2002d). An initial exploration of the link relationship between UK universityweb sites. ASLIB Proceedings, 54(2), 118-126.

Vaughan, L. & Thelwall, M. (2005, to appear). A modeling approach to uncover hyperlinkpatterns: The case of Canadian universities. Information Processing & Management.


Universities: International Links 93

UNIVERSITIES: INTERNATIONAL LINKS

OBJECTIVES

• To identify geopolitical and linguistic international linking patterns.• To give examples of the use of diagrams to illustrate international linking patterns and

discuss their possible applications.

INTRODUCTION

All of the studies reported in Table 8.1 used counts of links between universities in the samecountry. This is in contrast to citation statistics and to research itself, which are bothinternational. The national restriction has been a methodological necessity rather than a virtue.It is a common limitation for studies using a web crawler for data collection because it gives amanageable number of sites to be crawled. Moreover, it is not practical to use web crawlers tofind all links to a given university from the rest of the web because this would mean crawlingthe whole web in order to find the linking pages, a very large and expensive undertaking. Thenext section of this chapter compares national links with various sources of international linksas a (very partial) attempt to address the issue of differences between national andinternational links. The rest of the chapter is concerned with specifically international linkingpatterns.

The internationalization of research is an important issue for scientists and policymakers (Braun, Gomez, Mendez, & Schubert, 1992; Georghiou, 1998; Glanzel & Schubert,2001; Luukkonen, Persson & Silvertsen, 1992). The extent to which a nation's research isrecognized internationally and a nation's researchers collaborate with others overseas areoften investigated (e.g., de Beaver & Rosen, 1979; Fernandez, Gomez & Sebastian, 1998).For example, "The quality, relevance and international visibility of research have steadilyimproved" (Science and Technology Policy Council of Finland, 2000). In some cases,internationalization is a specific policy goal; considerable European Union financial resourcesback the international integration of European research. For instance, EU research fundingprojects typically require a minimum number of countries to be represented amongst thepartners (Europa, 2002). Since the web is a specifically international medium that isextensively used for different aspects of research, it is natural to investigate international

9


inter-university links. It is also logical to address the issue of language, and assess theinfluence of linguistic differences on international academic linking. The final sections of thischapter review research dealing with these issues.

NATIONAL VS. INTERNATIONAL LINKS

One investigation has used a commercial search engine to count links to UK universities fromthe 'whole' web (Thelwall, 2002). Different sources of UK university inlinks were compared,using correlation tests as a crude measure of the closeness of any relationship between linksand research. At the outset it had been predicted that international inlinks would show astronger connection with target university research productivity than national inlinks. Thiswas because an international link is presumably an indicator of international visibility, andinternational visibility is sometimes used as a criterion for judging the value of research. Asexpected, links from the .edu domain (mainly US universities) correlated very strongly withUK university research productivity, but the same was true for other top level domains,including .com, .net and .org, which will all contain some UK-based pages and some pagesfrom the rest of the world. As can be seen from Table 9.1, there does not seem to be astronger connection with research from international links than from national links,represented by the UK's .ac.uk academic domain. This domain spans UK universities inaddition to closely related organizations and colleges. Unfortunately, Pearson correlationswere used rather than Spearman, which would have given more robust results (<chapter 4).The differences between the top four correlations are very small and are not significant. Thesestatistics suggest that international academic links do not behave in a fundamentally differentway to national academic links. A link classification exercise would be needed to verify thishypothesis.

Table 9.1. Correlations between inlinks per faculty member and research assessment averagefor 96 British uniLink source

.ac.ukAll site inlinks.uk.edu•org.com.mil•gov.int.co.uk.net

versities (Thel'Total link C(

pages378,945

2,883,536611,811367,158250,796729,470

1,77127,528

1,530162,054159,602

wall, 2002).>rrelation

0.71***0.70***0.69***0.69***0.64***0.60***0.53***0.51***0.47***0.44***0.23*

: = significant at the 5% level, *** = significant at the 0.1% level


INTERNATIONAL LINKING COMPARISONS

The first attempt to compare links on an international scale was that of Ingwersen's (1998)seminal study, which contrasted inlink counts per page for several Nordic and other countries(including non-university sites). Significant national differences were found, with Norwayattracting the most inlinks per page. This was attributed to a national web marketing effort.

Links between universities in the historically related countries of Australia, NewZealand and the UK were analyzed much later using a specialist crawler to crawl theuniversities in the three countries in order to collect interlinking data (Smith & Thelwall,2002). These results are essentially subsumed within the paper reported below, but oneinteresting indicator was introduced, link propensity, defined to be the number of linksdivided by the total faculty size of the source universities and the total faculty size of thetarget universities. This is an attempt to factor size out from link calculations in order to avoiddisadvantaging small countries for international comparisons.

Patterns of Asia-Pacific university interconnectivity were analyzed in a later larger-scale investigation, using the commercial search engine AltaVista's (now defunct) advancedsearch facilities for link and page counts. Table 9.2 summarizes the results, and Figure 9.1illustrates raw link count figures, with arrow thickness proportional to link counts fromuniversity web sites at the origin of the arrow to university web sites at the target of the arrow.The diagram serves to confirm the importance for online linking of the more developedcountries in the region. It points to a degree of regionalization, for example with the highAustralia-New Zealand interlinking, but this is perhaps less strong than might have beenexpected. Essentially, the trend is for web presence size rather than regional location to be themost important factor. Linking trends were also investigated after normalizing for nationalweb publishing quantity (Figure 9.2). Interestingly, the pattern almost reversed afternormalization: the best-connected countries were the smaller ones. This suggests that nationsthat publish few pages actually host and receive a relatively high number of links per page. Interms of the chapter 5 web growth model, this suggests a uniform linking trend influence atthe country level, i.e. not all links grow through the 'rich get richer' principle.

Table 9.2. Summary of link count results from AltaVista on 10th January 2002 (Thelwall &Smith, 2002).

CountryAustraliaChinaFijiHong KongIndonesiaJapanKoreaMalaysiaNew ZealandPhilippinesPolynesiaSingaporeTaiwanThailandVietnam

Domainedu.auedu.cnac.fjedu.hkacidac.jpac.kredu.myac.nzedu.phupf.pfedu.sgedu.twac.thedu.vn

Total inlinks28136

389621875698

513133874599

7717592

7651

216069513896

80

Page count Inlinks per page3443953296616

582421936

3653137081131936208

25235858616

3569755

184485782316159833

1681

0.0080.0133.7580.0140.0140.0040.0020.0310.0090.0210.0180.0120.0090.0240.048


Figures 9.1 and 9.2. National academic interconnectivity in the Asia-Pacific region (left) andlinking normalized by source and target page counts (right). Official country codes are used

(Thelwall & Smith, 2002).

One project has tracked academic links between two geographic regions: Asia and Europe(Park & Thelwall, 2004). All ASEM (Asia-Europe Meeting) countries were selected. The 5most highly inlinked university web sites of each of the chosen 25 Asian and Europeancountries were selected and then links between them were counted using AltaVista. High webpublishing European countries dominated the results, particularly the UK, Germany, Hollandand Belgium. Again, web publishing size seems to be more important for links than regionalaffiliation. This may be partly due to the very different languages and scripts used by majorAsian countries (Japan, China, South Korea), with English perhaps being a common languagefor academic communication between them. However, there was some evidence of Chinabecoming a focal point for the region, with Chinese university web sites attracting significantnumbers of links. It is interesting that it is China, rather than the more economically advancedJapan that it playing this role. This is possibly due to the larger number of Chinese speakersand their historical spread throughout the region.

LINGUISTIC INFLUENCES

Smith (1999) pioneered webometric studies of linguistic influences on linking, comparing theweb sites of a sample of educational and research organizations in Latin America andAustralasia. He found evidence that Latin American pages in English tended to attract moreinlinks than those in other languages. Further, he postulated that Latin America risked beingbypassed on the web (e.g. attracting relatively few inlinks per page) because of the dominanceof English. The search engine AltaVista was used for link and page count data because itslanguage specific feature could be employed in conjunction with its advanced link countingcapabilities in order to deliver language-specific page count and link count figures. Note thatalthough search engines do have international biases, language does not seem to be a


significant factor (Vaughan & Thelwall, 2004) and so search engine coverage is not a likelyexplanation for the linguistic differences found. . B , , r n n PLinguistic factors have also been investigated for international interhnkmg in Europe.AltaVista was used to count links between all the universities of Europe, grouped into nations,and then to count the links again, restricting the results to link source pages in each of the"rgest European .anguages in turn. The findings were confirmatory rather than surpnsingand are summarized below for the 12 largest European Union countries of the time, plusNorway and Switzerland (Thelwall, Tang & Price, 2003).

. All countries except Greece hosted approximately 50% of university web pages inEnglish and approximately 50% in native languages (English for the UK and hire), inGreece, 90% of pages were in Greek.

. Countries sharing a common language tended to interlink with pages in that language.• English was extensively used in international linking pages throughout Europe, except

Greece.• Swedish was frequently used for linking within Scandinavia.

Figures 9 3 and 9.4 illustrate the results for two of the languages, German and Spanish. Thecontrast between the two is clear and explicable. German is spoken in Germany, Austna andSwitzerland, and Spanish only in Spain. All arrows represent linking strength betweenuniversities after normalizing for university sizes. Note that all arrows shown either originateor terminate in a country that officially speaks the language, although not necessarilyexclusively ^ ^ i m p o r tant finding from the European study was the importance ofEnglish for international link pages. Linguistic biases in academic webs are not surprisinggiven its dominance by the English speaking USA, and also the known language bias even „the primary science system for citation analysis (e.g., van Leeuwen, Moed, Tyssen et al.,2000).

Figures 9.3 and 9.4. Normalized linking between universities in German deft) and Spanish(right). Official country codes are used (Thelwall, Tang & Price, 2003).


Another study of European university interlinking has been published, but focusing ondifferent statistical approaches, rather than interpretation of the data (Musgrove, Binns, Page-Kennedy & Thelwall, 2003). The results confirm the isolation of Greece within the Europeanweb and also point to a tendency towards regionalization, with geographically close countrieshaving similar web linking profiles, even if they do not share a common language.

The specific issue of English as an academic web language has been addressed forMainland China-Taiwan links. Moed (2002) has claimed that Chinese research ofinternational impact tends to be published in predominantly English language non-Chinesejournals. It was hypothesized that this might be reflected on the web through English beingused in pages that link between Mainland China and Taiwan, despite the very similar scriptsused in both regions. In fact, it was found that English link pages were present but weredominated by pages in Chinese scripts (Thelwall & Tang, 2003). Presumably the distancebetween the English and Chinese languages is a barrier to publishing in English. There is alsoa healthy Chinese language research literature, which points to a healthy Chinese languageresearch culture and so there is not always a need to publish research-related material inEnglish.

Linguistic factors in linking have also been investigated on a national scale for Canada(Vaughan & Thelwall, 2005). French speaking universities were found to attract a muchlower number of inlinks than English speaking universities. Whilst English universitiestended to link to each other rather than to French universities, French universities did notinterlink at a very high level. There seemed to be a much lower general level of webpublishing and linking in the French speaking universities than in the English speaking ones.This suggests that for some reason the French language exerts a negative influence on webpublishing, perhaps for historical reasons, cultural connections with France (a late webadopter), or perhaps due to English web dominance.

SUMMARY

The findings reviewed support the utility of the diagram approach, which can be used forsimple but effective visual indicators of international connectivity. The international patternsof interlinking displayed have not been very surprising; they have probably served mainly toconfirm suspicions. Nevertheless, the link analysis approach gives quantitative evidence tosupport what would otherwise be unsubstantiated claims.

One concrete finding was the relative linguistic isolation of Greece within Europe(Thelwall, Tang & Price, 2003), which is potentially an important issue for the EuropeanUnion policy makers that funded the research. The significance of English in academic webs,at least in Europe, is a second important finding. The situation in French-Canadianuniversities is an example that should be of interest to Canadian policy makers. Falling belowthe web-linking norm for Canada, French-Canadian universities appear to be under-using theweb, a serous concern given its importance for many forms of information dissemination.Finally, the linguistic factors evident in international web linking, with countries sharing acommon language tending to interlink in that language, highlight a potential problem. Itseems that significant information is available on the web that is only being used by those thatcan read the language in which it is published.


FURTHER READING

The web sites of the European union funded web indicators projects EICSTES(www.eicstes.org) and WISER (www.wiserweb.org) are good sources of information relatingto international linking data, although they have a predominantly European focus. Also worthvisiting is www.cybergeography.org, if only for ideas about how to chart international links.Finally, the classic work of Gibbons, Limoges, Nowotny, et al. (1994) is essential backgroundreading for interpreting research and its relationship to global knowledge production.

REFERENCES

Braun, T., Gomez, I., Mendez, A., & Schubert, A. (1992). World flash on basic research:International co-authorship patterns in physics and its subfields, 1981-1985,Scientometrics, 24, 181-200.

de Beaver, D. & Rosen, R. (1979). Studies in scientific collaboration. Part II. Scientific co-authorship, research productivity and visibility in the French elite, Scientometrics, 1,133-149

Europa (2002). The sixth framework programme (2002-2006). Available:http://europa.eu. int/comm/research/fp6/index_en.html

Fernandez, M.T., Gomez, I. & Sebastian, J. (1998). La cooperacion cientffica de los paises deAmerica Latina a traves de indicadores bibliometricos. Interciencia, 23(6), 328-337.

Georghiou, L. (1998). Global cooperation in research. Research Policy, 27(4), 611 - 626.Gibbons, M., Limoges, C, Nowotny, H., Schwartzman, S., Scott, P. & Trow M. (1994). The

new production of knowledge. London: Sage.Glanzel, W. & Schubert, A. (2001). Double effort = Double impact? A critical view at

international co-authorship in chemistry, Scientometrics, 50(2), 199-214.Ingwersen, P. (1998). The calculation of Web Impact Factors. Journal of Documentation,

54(2), 236-243.Luukkonen, T., Persson, O. & Silvertsen, G. (1992). Understanding patterns of international

scientific collaboration, Science, Technology & Human Values, 17, 101-126Moed, H.F. (2002). Measuring China's research performance using the Science Citation

Index, Scientometrics, 53(3), 281-296.Musgrove, P., Binns, R., Page-Kennedy, T., & Thelwall, M. (2003). A method for identifying

clusters in sets of interlinking Web spaces. Scientometrics, 58(3), 657-672.Park, H.W. & Thelwall, M. (2004, submitted). Web science communication in the age of

globalization: Links among universities' websites in Asia and Europe. New Media andSociety.

Science and Technology Policy Council of Finland (2000). Review 2000: The challenge ofknowledge and know-how. Available:http://www.minedu.fi/tiede ja_teknologianeuvosto/eng/publications/Review_2000.html

Smith, A.G. & Thelwall, M. (2002). Web Impact Factors for Australasian universities,Scientometrics, 54(3), 363-380.

Smith, A.G. (1999). The impact of web sites: A comparison between Australasia and LatinAmerica. In: Proceedings of INFO'99, Congreso International de Information, Havana,4-8 October 1999. Available: http://www.vuw.ac.nz/staff/alastair_smith/publns/austlat/


Thelwall, M. & Smith, A.G. (2002). A study of the interlinking between Asia-Pacificuniversity Web sites. Scientometrics, 55(3), 363-376.

Thelwall, M., Tang, R. & Price, E. (2003). Linguistic patterns of academic web use inWestern Europe, Scientometrics, 56(3), 417-432.

Thelwall, M. & Tang, R. (2003). Disciplinary and linguistic considerations for academic weblinking: An exploratory hyperlink mediated study with Mainland China and Taiwan,Scientometrics, 58(1), 153-179.

Thelwall, M. (2002). A comparison of sources of links for academic Web Impact Factorcalculations, Journal of Documentation, 58(1), 60-72.

van Leeuwen, T., Moed, H. F., Tijssen, R. J. W., Visser, M. S. & van Raan, A. F. J. (2000).First evidence of serious language-bias in the use of citation analysis for the evaluationof national science systems. Research Evaluation, 8(2), 155-156.


Vaughan, L. & Thelwall, M. (2005, to appear). A modeling approach to uncover hyperlinkpatterns: The case of Canadian universities. Information Processing & Management.

Departments and Subjects 101

10

DEPARTMENTS AND SUBJECTS

OBJECTIVES

• To discuss the importance of scale upon the validity of link count results.• To review disciplinary differences in departmental level link count analyses.

INTRODUCTION

University level link analysis, as reported in the previous three chapters, is inherentlymultidisciplinary because typical universities span a wide range of subjects. In contrast, ananalysis of inter-departmental links can be subject-specific, for example by investigating theinterlinking of computer science departments within a country. Departmental inlink counts,like university inlink counts, can be used in an evaluative framework to assess whether sitesattract enough inlinks, or in a relational framework to find patterns of links between sites. Fordepartmental web sites, links between different subjects can also be investigated. This has aprecedent in scientometrics, where it is common to study scholarly communication patternswithin and between subjects. The following examples illustrate the major trends.

• In author co-citation analysis (White & Griffith, 1982), the units of study are individualauthors and their papers. Relationships between authors can be found by statisticaltechniques, including those that produce two-dimensional maps of authors and subjectrelationships.

• Small's (1999) maps of science are pictures of the whole of science drawn by computersoftware based upon journal articles. Articles are placed close together in the map if thereare strong citation-based connections or similarities between them. Small-scale mapsshow the internal structures of fields and disciplines, whereas large-scale maps illustratepatterns of interdiscipliniarity.

• Journal networks use journals as the basic unit of analysis, using inter-journal citations tomeasure the strength of connections between journals (Leydesdorff, 2004). These aretypically used to reveal the internal structure of a single subject, but can also showrelevant journals outside of the subject.


The diagrams produced by citation techniques are useful for academics to learn how theirfield is structured, i.e., which are the coherent sub-fields and which new topics are emerging.An important challenge for link analysis is to derive similar information from the web usinglinks. These diagrams could be timelier because citation analysis is conducted on research thatis typically several years old due to the delay between starting research and outcomes beingpublished.

This chapter mainly focuses on inter-departmental links within the same subject. Theresults are set in context with the previous inter-university linking findings. University-levellink analysis serves as a guiding framework for the less-developed area of departmental levellink analysis. Issues of scale and interdiscipliniarity are important to address when translatingbetween the two levels. The issue of scale is addressed because departmental web pages sitinside university web sites and are, therefore, logically smaller. In some cases departmentalweb sites are so small that some types of link analysis are possible. Disciplinary differences inweb linking practices are also reviewed.

DEPARTMENTAL WEB SITES

A brief introduction to universities, departments and departmental web sites is given here toset a framework for the subsequent link analysis. Universities are, by definitionmultidisciplinary entities, covering a broad range of subjects. Although there are examples ofuniversity level institutions that specialize in a narrow range of subjects, these do not seem tobe common in any country. Universities are normally split into subject groupings for teachingand research purposes, using names such as faculty, school, department, section, centre,institute. In a typical university these are probably nested. For example, the university maycontain several faculties, each containing several schools, containing several departments andso on. Each level of organization will contain a degree of administrative apparatus in additionto the faculty members.

In this chapter the term 'department' is used as shorthand for the organizational unit atthe appropriate size for subject-based investigations, although in some cases 'departments' areactually schools or sections.

University web sites seem to normally reflect the formal structure of the university.Each faculty, school and department may have its own separate web space, probablycontrolled by its own people. In addition, administrative units such as the personneldepartment may also have their own pages. There will also be a main university sitecontaining basic information. For any given department, although there will probably beinformation about it in the general university pages (e.g. in the prospectus), its own web site islikely to be distinct and clearly demarcated from the rest of the university site. The followinglist suggests some information commonly found in departmental web sites.

• General information about the department, including an overview of teaching andresearch, geographic location, and a staff list.

• Pages created by individual academics, including personal home pages, teaching pagesand information about their research.

• Teaching support pages, such as complete lists of course information.• Subsites for research groups.


• In some cases, web sites containing extensive resources for research and/or teaching,perhaps created as part of an externally-funded project.

• Administrative pages, such an online calendar.• Mirror sites, such as copies of online computer software manuals.

Departmental web sites are normally given their own domain name, at least in the UK and theUS. For example www.csc.lsu.edu and bit.csc.lsu.edu are official domain names for LouisianaState University Department of Computer Science. Departmental subgroups may have a sub-domain (e.g., Software Engineering Laboratory selab.csc.lsu.edu) or a subdirectory (e.g.,Sensor Networking Research Laboratory bit.csc.lsu.edu/sensor_web/). Other subgroups mayhave their own separate domain name (e.g., Biological Computation and Visualization Centerbcvc.lsu.edu), perhaps even one that is not formally affiliated to the university, such as a dotcom (e.g., www.bioinspired.com is a site of the Intelligent Systems Group, University ofYork). In contrast, some universities allocate subdirectories of the main domain instead ofdomain names to departments (e.g., Sheffield University Department of Information Studieswww.shef.ac.uk/is/).

URL naming conventions are important because link analyses use URL matching toidentify departmental web pages. The identification of domain names and directories forspecific departments is typically a labor-intensive task with a researcher using search engines,link lists and browsing to attempt to find all of a department's web pages or sites (Li, 2004).

DISCIPLINARY DIFFERENCES IN LINK TYPES

Inter-departmental links are logically a subset of inter-university links. Some types of inter-university links, such as acknowledgement links to university home pages, will clearly not beinter-departmental links. Thus, the overall distribution of link types will be different from thatfor whole universities. In this section, disciplinary differences in link types are reviewed. Theresults should be read in context with the subject prevalence results given in chapter 6. Inparticular, it should be remembered that computing and hard science subjects make extensiveuse of the web and tend to dominate university web sites. The extent of web publishing in anygiven subject depends both upon the size of the subject area (i.e. the number of activeresearchers) and upon the rate of publishing in that area.

One classification exercise has sought both disciplinary differences in linking practiceand differences between inter- and intra-disciplinary links (Harries, Wilkinson, Price,Fairclough & Thelwall, 2004), i.e. links between different disciplines (inter-) and links withinthe same discipline (intra-). See chapter 7 for graphs summarizing the different link type offacets used. Significant differences were found between maths, physics and sociology linkingpractices, and also between intra- and inter-disciplinary linking practices. There weredifferences in the proportions of each link type for link source and target pages. Thedifferences were in both the content type of linked pages and the genre of the linked pages.No differences were found between the different categories of page owners for link sourcepages, but disciplinary differences were found in the different types of owner for link targetpages. In summary, disciplinary differences are evident in several aspects of linking.

The following examples illustrate some of the differences found between inter-subjectand intra-subject links.


• Maths [owner type] Inter-subject links in maths were more likely to target pages ownedby outsiders, and intra-subject links in maths were more likely to target research groupsor departments.

• Physics [genre type] Inter-subject links were more likely to target academics' homepages, and intra-subject links were more likely to target departmental home pages.

• Sociology [content type] Inter-subject links were more likely to originate in researchdescription pages, and intra-subject were more likely to originate in subject informationpages.

(Harries, Wilkinson, Price, Fairclough & Thelwall, 2004)

The following examples illustrate some of the disciplinary differences found in linkingpatterns.

• Mathematicians were the opposite of physicists in having the highest tendency to link tofellow subject members' home pages (21 %).

• A relatively high number of links from mathematics to non- mathematics pages targetedexternally-owned pages. The main causes of these were pages in the uk.arXiv.org eprintarchive mirror (xxx.soton.ac.uk, 10 inlinks) a European Mathematical Information Servicemirror site (www.maths.soton.ac.uk/emis, 7 inlinks), and the Isaac Newton Institute forMathematical Sciences (www.newton.cam.ac.uk, 6 inlinks).

• A relatively high proportion of links from physics departments to other physicsdepartments, targeted departmental home pages and a relatively low proportion targetedacademics' home pages.

• Sociologists tended to link to other sociologists through the medium of link lists (63%)but did not use link lists much to link outside (26%).

(Harries, Wilkinson, Price, Fairclough & Thelwall, 2004)

No clear pattern was found in the differences or commonalities that could be used to postulatea coherent causal relationship. The different disciplines seem to have just adopted differentlinking preferences, probably with a degree of imitation between like-minded scholars. Thepotential for imitation and the influence of individual initiatives, such as popular subject-based web portals or resource sites, undermine any attempt to identify inherent disciplinarylinking factors. Kling and McKim (2000) claim that disciplinary differences in use ofelectronic media are embedded in the nature of disciplines and fields. It is not a simple matterof light users eventually catching up with heavy users, the differences are more fundamentaland this is likely to continue indefinitely. This is a useful issue for future research.

ISSUES OF SCALE AND CORRELATION TESTS

The importance of scale for departmental link analysis is twofold. First, if departments' websites are too small then no link analysis is possible. For example, an investigation into UShistory department interlinking was aborted when only three inter-departmental links werefound in the whole set (Tang & Thelwall, 2003). Second, even if there is a significant degreeof inter-departmental linking, there need to be enough to give the law of averages a chance toapply so that any linking patterns found are not primarily the product of a few individual


'random' links. A simple test for sufficient size to extract meaningful results is a correlationtest (<chapter 4). A significant correlation between normalized inlinks and normalizedresearch productivity is evidence of an underlying link creation trend that may allow usefulresults to be extracted.

Departments do not all have similar-sized web presences. In addition to individualdepartmental web publishing activity and policy decisions, two other factors are significant.The first is faculty size: larger departments should logically have larger web sites. The sizefactor also operates on a national level because larger disciplines should have largerdepartments and/or more departments. The second important factor is discipline. As shown inchapter 7, some subjects are better represented in university web sites than others. Aninvestigation has confirmed this for the departmental web sites of three disciplines in the US(Tang & Thelwall, 2003). Some history departments did not seem to have a web site at all,however. The median web site sizes found were as follows.

• Chemistry: 1,786 pages per department• Psychology: 604 pages per department• History: 113 pages per department

Table 10.1. Link-research correlations for departments.Country

UK

UK

US

US

UK

Australia

Canada

US

US

Subject

Computing

Library and informationscience

Library and informationscienceChemistry

Physics/chemistry/biology



Psychology

History

Outcome

Significant correlation between inlinks andresearch (RAE) ratings (Li et al., 2003)No significant correlation between inlinksand research ratings (Thomas & Willett,2000).No significant correlation between in-linksand US News rankings (Chu et al., 2002)Significant correlation between inter-departmental inlinks, and ISI citationstatistics (Tang & Thelwall, 2003).Significant correlation between inter-departmental inlinks and research (RAE)ratings (Li, 2004).Significant correlation between inter-departmental inlinks and ISI citationstatistics (Li, 2004).Significant correlation between inter-departmental inlinks and research grants(Li, 2004).Significant correlation between inlinks andinter-departmental inlinks, and ISI citationstatistics (Tang & Thelwall, 2003).No significant link trend: very small sitesand too few links (Tang & Thelwall, 2003)

Table 10.1 summarizes the findings of departmental correlation studies. The table omits casesof insignificant results when later similar studies found significant results. It seems that hard


sciences use the web enough for significant results to be found but, based upon the sizestudies discussed above, the same may not be true for other subjects. There are certainlydisciplinary differences in the extent to which different subjects make use of the web(<chapter 6) and, given the linking model presented in the previous chapter, this shoulddirectly impact upon linking. Subjects that publish fewer web pages should attract andproduce fewer links.

Incorporating the chapter 6 results, the trend seems to be for computing departments topublish most, followed by maths and the hard sciences, then the social sciences, and finallyarts and the humanities. Of course, there will be individual exceptions to this such as prolificweb-based artists and web-phobic computer scientists.

GEOGRAPHIC AND INTERNATIONAL FACTORS

The examples of geographic trends found in UK and Canadian inter-university linking(<chapter 8) make it natural to enquire whether similar trends occur in departmental levellinking. Two departmental link analyses have addressed this issue. A geographic trend wassought in interlinking between US chemistry and between US psychology departments, butnone found (Tang & Thelwall, 2003). It is possible that geography is not important in USresearch, or that it is less important for disciplines than for entire universities, but this is notyet clear.

A second study investigated the sources of international inlinks to US departments ofhistory, chemistry and psychology (Tang & Thelwall, 2004). There were strongcommonalities, such as attracting more links from Europe and Asia than from the rest of thecontinents of America but there were also disciplinary differences. For example, links tohistory web sites had the lowest proportion of international links, only 6%. Perhaps thesubject of history involves some research of particularly national significance, or thedominance of books, rather than journals in the discipline may also be reflected in differentpatterns of web use.

SUMMARY

The statistical results suggest that the logical linking models of chapter 8 should also apply todisciplinary level links, even though web publishing does vary by subject. The differingextent to which subjects use the web is an important factor in whether departmental web sitesfor any given subject will be large enough to admit a link analysis.

The national and international disciplinary differences in link types discussed haveimplications for maps of science based upon departmental-level web links. The same numberof links will have different broad types depending upon discipline and whether they are inter-or intra-disciplinary. Departmental-level relational analyses therefore need to be interpreteddifferently for different subjects, and also with respect to inter-disciplinarily and internationaltrends.


FURTHER READING

The articles discussed in the text are all worth reading for their methods and additional results,for those wanting more specific details. The early study of Thomas and Willett (2000) is alsovaluable for its theoretical perspective and analysis. See Nentwich (2003) for a wide-rangingand discipline-sensitive overview of academic uses of the Internet.

REFERENCES

Chu, H., He, S. & Thelwall, M. (2002). Library and information science schools in Canadaand USA: A Webometric perspective. Journal of Education for Library and InformationScience 43(2), 110-125.

Harries, G., Wilkinson, D., Price, E., Fairclough, R. & Thelwall, M. (2004, to appear).Hyperlinks as a data source for science mapping, Journal of Information Science 30(5).

Kling, R. & McKim, G. (2000). Not just a matter of time: Field differences in the shaping ofelectronic media in supporting scientific communication. Journal of the AmericanSociety for Information Science, 51 (14), 1306-1320.

Leydesdorff, L. (2004). Top-down decomposition of the Journal Citation Report of the SocialScience Citation Index: Graph- and factor-analytical approaches. Scientometrics, 60(2),159-180.

Li, X., Thelwall, M., Musgrove, P. & Wilkinson, D. (2003). The relationship between thelinks/Web Impact Factors of computer science departments in UK and their RAE(Research Assessment Exercise) ranking in 2001, Scientometrics, 57(2), 239-255.

Li, X. (2004, to appear). Disciplinary differences in Web Impact Factors. PhD thesis,University of Wolverhampton.


Small, H. (1999). Visualising science through citation mapping, Journal of the AmericanSociety for Information Science, 50(9), 799-812.

Tang, R. & Thelwall, M. (2003). Disciplinary differences in US academic departmental website interlinking, Library & Information Science Research, 25(4), 437-458.

Tang, R. & Thelwall, M. (2004). Patterns of national and international web inlinks to USacademic departments: An analysis of disciplinary variations. Scientometrics, 60(3),475-485.

Thomas, O. & Willett, P. (2000). Webometric analysis of departments of Librarianship andInformation Science. Journal of Information Science, 26(6), 421-428.

White, H.D. & Griffith, B.C. (1982). Author co-citation: a literature measure of intellectualstructure. Journal of the American Society for Information Science, 32(3), 163-172.


Journals and Articles 109

11

JOURNALS AND ARTICLES

OBJECTIVES

• To explore the relationship between journal article citations and journal web site inlinks.• To introduce the study of links in the managed environment of a digital library.• To review other research relating to journals and links.

INTRODUCTION

Journal web sites are a natural subject for information science link analysis because links, likecitations, are inter-document connections (<chapter 7; <chapter 1; Harter & Ford, 2000;Ingwersen, 1998; Rodriguez i Gain'n, 1997; Smith, 1999). In this chapter, links that originateor terminate in journal articles or journal web sites are the focus of attention. The main thrustof the chapter, in keeping with the theme of the book, is the link analysis of journal web sites.The two sections covering this issue are preceded by a section giving an overview of journalcitation metrics and a section introducing journal web site types.

JOURNAL IMPACT FACTORS

The citation-based Journal Impact Factors (JIFs) published by the Institute for ScientificInformation (ISI) are a widely-used source of information about the impact of journals(Garfield, 1994). The Impact Factor of a journal is a normalised count of citations to thejournal in a given year. Specifically, for a given journal and a year n, the JIF is C/P where; Cis the number of citations to items published in the journal in year n, counting citations in ISI-indexed publications from the two following years, n+\ and n+2; and P is the number ofcitable items published in the journal in the year n. Citable items are normally just articles,excluding letters and editorials. The ISFs JIF is in most cases the average number of citationsa journal's articles receive in ISI-indexed journals in the two years following its year ofpublication.


The rationale for the JIF is that journals receiving more citations per article are, onaverage, better because their contents have been found to be more useful (cf. Merton, 1973;<chapter 7). Although this is clearly an oversimplification, journal Impact Factors are still auseful tool if used properly (Garfield, 1994). The importance of the web and the surfacesimilarity between links and citations has triggered investigations into whether link countscould also be used as indicators of impact for journal web sites, or even for the journalsthemselves. The drive to assess journal-related links was also fuelled by uncertainty about thefuture of scholarly publishing, and whether it would follow the freely available e-journalmodel. As demonstrated below in the digital library sections, even though commercialpublishers have not been supplanted by free e-journals, their digital libraries are becomingspaces where link analysis can uncover unprecedented information about scholarlycommunication.

JOURNAL WEB SITES

There are many different ways in which a journal can have a presence on the web, rangingfrom a line in a web page to full text online (Vaughan & Thelwall, 2003). The followingtypology is adapted from Kling and Callahan (2004).

• Pure e-journals are published only in electronic form.• Pure paper journals are only published in print form and are not distributed electronically.• E/paper journals are distributed in both paper and electronic form.

Before the web, most journals were pure paper journals, although some pure e-journals alsoexisted. By 2004, it seemed that the major academic publishers all offered the choice ofelectronic or print versions of their journals, typically holding electronic versions in access-controlled digital libraries. Pure paper journals continued to exist with smaller publishers andin some less technologically advanced countries. Pure e-journals appear to have experiencedcontinued growth and at least partial acceptance (Kling & Callahan, 2004; Sweeney, 2000),perhaps helped by the Institute for Scientific Information being prepared to index them.Nevertheless, they do not seem to be threatening the commercial publishers in most subjectareas. These developments are in line with the predictions of Oppenheim, Greenhalgh andRowland (2000).

From a link analysis perspective, it is important to distinguish between the differenttypes of web site associated with journals, in terms of both publicly accessible content andpassword protected content. Table 11.1 summarizes the types that seem to be the mostcommon, at least for ISI-indexed journals (Vaughan & Thelwall, 2003). The type of URL isimportant because publishers' digital libraries tend to be database-driven, which can leadthem to have complex URLs that search engines may not index, and to which it may bedifficult to link. Direct links to journals are actually impossible in some digital libraries(Vaughan & Thelwall, 2003). Pure e-journals often have their own web site with a verysimple URL.


Table 11.1. Different common types of journal web presences.

Web mention - journal titleonlyBasic web site (e.g. authorinstructions, aims and scope)Contents lists of article titlesArticle abstractsArticle full textURL

Journal inpublisher's digital

libraryFree

Free

FreeFree

Password protectedMay be simple or

complex (e.g.database query)

Pure paperjournal

Free, if available

Free, if available

Free, if availableFree, if available

Not availableNot available

Pure e-journal

Free

Free

FreeFreeFree

Simple

JOURNAL WEB SITE INLINKS: ISSUES

Two early comparisons between citations and inlinks gave negative results and pointed tofundamental differences between citations and inlinks (Harter & Ford, 2000; Smith, 1999). Asa result, citation-inlink comparisons subsequently fell out of favor. In the first few years of theweb there were not many journal web sites, and so any reasonable-sized sample had toinclude multiple subjects and different types of web site. In retrospect, then, the negativeresults were partly due to the heterogeneous nature of the web sites and journals covered. Asecond issue that discouraged further inlink-citation research was purely technical. Severalinvestigations identified serious shortcomings in search engine results, undermining theirvalue for link data collection (e.g., Snyder & Rosenbaum, 1999). This resulted in thedevelopment of personal web crawlers as an alternative data collection method for thedifferent problem area of university link research. Journal web site inlinks are presumablyvery international in scope, however, and so it would be impractical to use a crawler todiscover them, leaving no effective strategy to find journal inlinks. This second issue wassolved by the introduction of a new generation of stable search engine algorithms. Althoughnot formally announced, important improvements in stability and internal consistency didoccur, particularly in AltaVista (Thelwall, 2001). The first issue, the lack of positive results,was also subsequently resolved by Vaughan and Hysen (2002) finding statistically significantcorrelations between inlinks and citation-based Impact Factors, an important breakthrough.

Despite the significant results of Vaughan and Hysen (2002) and others, importantissues still remain. Smith (1999) found that a fundamental difference between citations andlinks was that links tended to be to the whole journal, rather than to an individual article.Inlink counts are therefore a potential indicator of the online recognition of the journal itself,


whereas the JIF is the average impact of its articles. It seems possible that the two are closelyrelated, as Vaughan and Hysen (2002) suggest. Harter and Ford (2000) echoed Smith'sconcern over the nature of links. They also remarked upon many causes of unreliability (ashad Smith). These included: changing URLs, search engine instability, mirror sites, unknownprimary or alternative journal web site locations, structural differences in site organization,and errors in creating link URLs.

One of the key unresolved issues concerning counting links to journal web sites iswhether to normalize or average the inlink count figures. Since dividing citations by citableitems normalizes Journal Impact Factors, should the number of linkable items, e.g. web pages,also divide link counts? This is a difficult question because so many links to journal web sitestarget the home page (i.e. the site as a whole). A logical compromise might perhaps be todivide links by site page counts only for links that do not target the home page, counting theothers at full value. This would partially compensate for site size. A second issue is age: sincepublication date is important in JIF calculations, should links only be counted from pagescreated within a specified time frame? Again this is problematic because old links to ajournal's home page may still be actively used, with no specific need to create a new linkwhen more articles are published. These questions may be given an answer by a future study,perhaps through careful comparisons of links statistics, but they further illustrate the pitfallsof making direct comparisons between citations and links.

JOURNAL WEB SITE INLINKS: CASE STUDY

This section reports on a single study that gives a good overview of the phenomenon ofjournal web site inlinks. The investigation is a comparison of journal web site inlinks with(citation-based) journal Impact Factors for 88 law journals and 53 library and informationscience (LIS) journals, selected from the Institute for Scientific Information's subjectcategories (Vaughan & Thelwall, 2003). Journals were excluded that either did not have aweb site or had a web site in a form such that search engines could not count links.Statistically significant correlations were found between journal Impact Factors and journalweb site inlinks in both LIS and law, confirming the earlier study for LIS (Vaughan & Hysen,2002), which was the first investigation to find a significant JIF-inlink count correlation. Thisis evidence of a relationship between the average citation impact of journals and inlinks totheir web sites, but it does not imply that one causes the other. The study investigated otherfactors that may contribute to linking in order to get a clearer picture of whether inlinks couldbe used as measures of impact.

• Web site content The journals in the two data sets were homogeneous in the sense ofbroad subject content but heterogeneous in terms of web site content. Some journalweb sites consisted of a single page whereas others were free e-journals. A statisticaltest revealed that web sites containing more information tended to attract more inlinks.This is not surprising, but is a reminder that links are often created for a reason, andwithout significant content in a journal's web site there is less reason to link to it.

• Web site age Growth models of the web point to age as an important factor in linkcounts; whilst it is possible for new sites to attract many inlinks, it is more likely for


older sites to have a high iniink count. Is this also true and a significant factor forjournal web sites'? Statistical tests showed that older journal web sites did tend toattract more inlinks. Web site age was estimated by the earliest date when the web sitewas recorded in the Internet Archive, so 'oldest known age' is really a more accuratedescription than 'age'.

• Journal discipline Disciplinary differences were found in the results. The mediannumber of inlinks per unit of journal Impact Factor was 19 for law but 100 for LIS,pointing to a much higher ratio of online interest to average citation impact for LISjournals. This could be because LIS has a higher online presence (law has a low webpresence, e.g., Thelwall, Vaughan, Cothey et al., 2003) or because law has a highercitation use.

It is clear from this study that although average citation impact transfers or relates to linkcreation in a significant way, this is mediated by site age, site content and discipline. This issummarized in Figure 11.1. The lesson for journal webmasters is clear if they want to attractvisitors through links then they should put as much useful content as possible in their web siteand keep its URL as stable as possible (Vaughan & Thelwall, 2003).

Figure 11.1. The journal web site linking model.

The remainder of this chapter deals with issues relating to journals and links, but not directlyrelating to journal web sites inlinks.

TYPES OF LINKS IN JOURNAL ARTICLES

Links in journal articles have been a topic of research for reasons not directly connected tolink analysis; these arc briefly reviewed in this section. The types of links and URLs injournal articles reveal differences between traditional citations and links in the environment inwhich they are most similar, and even overlap. Kim (2000) interviewed a number of authorsof e-joumal articles in order to discover their motivations for creating a hyperlink. A commonreason was the provision of an easy access mechanism to allow readers to locate the citedinformation. Another motivation that is unlike traditional citation practice was simply thatcreating a link was possible. Kim concluded that, "Hyperlink counts as a measure of qualityare suspect due to the complex nature of motivations for their use."

Herring (2003) also investigated links in e-journal articles, but from the perspective ofinvestigating the types of resources to which they link. Her two key findings were: that thedocuments to which they linked exhibited a much greater range of different types than thedocuments cited in print resources; and that there was a greater range of interdiscipliniarity inthe links found. These may have been caused by the relative ease with which authors can


conduct interdisciplinary searches on the web using commercial search engines, and thesesame search engines will retrieve non-scholarly documents in parallel with scholarly ones.

The two papers briefly reviewed above show that differences between URLs andcitations occur even when links are used as citations in journal articles.

DIGITAL LIBRARY LINKS

Digital libraries are now commonly web-based and allow articles to contain clickablehyperlinks. This provides an entirely new arena from which to study links and scholarlycommunication. Wouters and de Vries (2005) have contributed a unique study into linkingfrom articles within digital libraries. They analyzed links in both Portable Document Format(PDF) and HTML documents, taking some from the Science Direct digital library and othersfrom the web. Linking practices varied greatly by discipline and journal type. For examplepure e-journals were more likely to link to pages outside of the digital library, and sociologyarticles contained relatively few links per citation compared to the other subjects analyzed.The format in which the articles were stored was a surprisingly important factor: PDFdocuments tended to contain few hyperlinks or none at all. From a validity perspective, thetask of interpreting the results was complicated by the fact that the digital library softwareautomatically inserted many of the links. This type of research is technologically challengingbecause of the need to process the complex PDF document format and the necessity for fullaccess to the digital library in order to gather the necessary data. Digital library link analysismay be the future for traditional citation analysis, however, because of the wide uptake of theDigital Object Identifier (DOI) standards (www.doi.org), which facilitates robust citationlinking, even between different digital libraries.

COMBINED LINK AND LOG FILE ANALYSIS

Web server log files are collections of information about the pages and other files that usershave requested from web servers. They are a very good source of information about how asite has been used. Web server log files are an ideal source of additional information for linkanalysis because they indicate how often a link has actually been used. They have not beenused often in conjunction with links for an information science style of link analysis for thesimple reason that they are normally kept private to each web server's webmaster. A linkanalysis study t\ pically spans many web servers and it would be impractical to gain access toall of the necessary log files for a comprehensive analysis. Nevertheless, there is someparticularly interesting research analyzing log files in digital libraries. Within the boundariesof one particular web site, such as a digital library, full analyses can be conducted from asingle log file.

Harnad and Carr (2000) have investigated linking within the Los Alamos EprintArchive, an eprint archive for physics. The Open Citation Linking Project automaticallyadded links to articles in the archive, connecting them to cited papers. This anticipated theadoption of similar technology years later by commercial publishers. One very useful facility


that the added links provide is the ability to use web server log files to track visitors as theynavigate to cited references. This is something that has not been previously possible with printliterature and has a promising future, especially if digital library owners allow researchersaccess to this kind of data. An interesting, but not strictly relevant, byproduct of thecombination of linking software and web log analysis was the discovery that papers tended tobe read by users when they had been initially deposited in the archive, and also when they hadbeen first cited by another archive paper.

Brody, Carr and Harnad (2002) continued the analysis of eprint archives throughautomatically added hyperlinks and web server log files, concentrating on the high-energyphysics area. An astonishing finding was that the average time between a paper beingdeposited in the archive and it receiving its first citation 'decreased over the period of thearchive from about a year to about a month', suggesting a speeding up of the process ofscientific communication, presumably aided by the instant online availability of eprints.

RELATED RESEARCH TOPICS

Perhaps the most frequently discussed issue concerning links and scholarly communication isthe possibility of broken links. If an article cites URLs then, over time, these URLs maydisappear and this may undermine the argument in the article itself. In print journals this hasnot been a worry because of ongoing initiatives to archive journals. On the web, there is also alarge-scale archiving initiative, the Internet Archive (www.archive.org) but this does notclaim complete coverage of the web. Casserly and Bird (2003) have looked at the issue of theavailability of URLs that have been cited in scholarly papers, using five hundred link-citationsfrom library and information science journals. They found that just under 10% could not befound, even after searching the Internet Archive. Even this relatively small percentage is stilla cause for concern, since the loss of one reference could be critical. As a result, Casserly andBird provide a series of guidelines to minimize the impact of lost URLs.

There are other papers about electronic citation that are worth reading for additionalperspectives and related issues concerning traditional citation in the web. Some key points arepicked out below.

• Vaughan and Shaw (2003) analyzed text based citations of journal articles in web pages,finding that for the majority of journals studied, web citations of individual articlescorrelated significantly with ISI citations. Citations to articles in journals also correlatedsignificantly with ISI impact factors in most cases.

• Lawrence (2001) asserted that making articles available online increases their chance tobe cited.

• Goodrum, McCain, Lawrence and Giles (2002) compared citations in the computer-science dominated CiteSeer digital library with ISI citations, finding that conferencepapers received a significantly higher share of online citations than ISI citations,indicating a fundamentally different character for the two data sources.

• Zhao and Logan (2002) analyzed articles relating to the XML research area usingCiteSeer. They found that online and offline citation analyses both had strengths andweaknesses, and they could best be used in parallel. This is probably most true for


computer science, where conference proceedings have a higher status than in most otherdisciplines, and are frequently posted on the web.

• Some research has operated on a larger scale, conducting comparative analyses of links tolibrary web sites, giving some insights into how libraries are perceived and used online(Tang & Thelwall, 2004/5; Vreeland, 2000).

• Finally, in an interesting investigation unrelated to links, Cronin, Snyder, Rosenbaum,Martinson, and Callahan (1998) have investigated the context in which acacemics arementioned in web pages.

SUMMARY

Comparisons of journal web site inlinks and the citation-based journal Impact Factors havenot yet reached a definite conclusion. It is clear that the two phenomena correlatesignificantly, but that journal inlinks are influenced by web site age and content as well asjournal discipline. It is not clear yet whether they are essentially measuring the same thing,i.e., whether inlink counts - or a modified version - could be described as impact measures,or whether an alternative description, such as online visibility or utility would be appropriate.The results so far are consistent with inlink counts being a hybrid measure of traditionalimpact and online availability and utility.

The digital library research reported really just gives a taster of the future potential forscientometric analysis, if journal publishers give access to their web server log files. If theydo, then this could give access to a much more detailed source of information about formalscholarly communication and, in particular, about the way in which citations are used.

FURTHER READING

The early papers of Smith (1999) and Harter and Ford (2000) are worth reading for bothhistorical perspective and theoretical discussion. The review chapter of Kling, and Callahan(2004) is essential reading for a deeper analysis of electronic journals. See also Rudner,Gellmann and Miller-Whitehead (2002) for an approach that combines web log files withother sources of information about e-journal use. This gives good ideas for future linkanalysis research, although links were not used in the article.

REFERENCES

Brody, T., Carr, L., & Harnad, S. (2002). Evidence of hypertext in the scholarly archive.Proceedings of ACM Hypertext 2002, 74-75.

Casserly, M.F. & Bird, J. E. (2003). Web citation availability: Analysis and implications forscholarship. College and Research Libraries, 64(4), 300-317.

Cronin, B., Snyder, H. W., Rosenbaum, H., Martinson, A., & Callahan, E. (1998). Invoked onthe Web. Journal of the American Society for Information Science, 49(14), 1319-1328.

Garfield, E. (1994). The impact factor, Current Contents, June 20. Available:http://www.isinet.eom/isi/hot/essays/journalcitationreports/7.html


Goodrum, A.A., McCain, K.W., Lawrence, S. & Giles, C.L. (2001). Scholarly publishing inthe Internet age: a citation analysis of computer science literature. InformationProcessing & Management, 37(5), 661-676.

Harnad, S. & Carr, L. (2000). Integrating, navigating, and analysing open eprint archivesthrough open citation linking (the OpCit project). Current Science, 79(5), 629-638.

Hatter, S. & Ford, C. (2000). Web-based analysis of e-journal impact: Approaches, problems,and issues, Journal of the American Society for Information Science, 51(13), 1159-76.

Herring, S.D. (2002). Use of electronic resources in scholarly electronic journals: A citationanalysis. College & Research Libraries, 63(4), 334-340.

Ingwersen, P. (1998). The calculation of Web Impact Factors. Journal of Documentation,54(2), 236-243.

Kim, H. J. (2000). Motivations for hyperlinking in scholarly electronic articles: A qualitativestudy. Journal of the American Society for Information Science, 51(10), 887-899.

Kling, R. & Callahan, E. (2004). Electronic journals, the Internet, and scholarlycommunication, Annual Review of Information Science and Technology, 38, 127-177.

Lawrence, S. (2001). Free online availability substantially increases a paper's impact, Nature,411(6837), 521.

Merton, R. (1973). The sociology of science. Theoretical and empirical investigations.Chicago: University of Chicago Press.

Oppenheim, C, Greenhalgh, C, & Rowland, F. (2000). The future of scholarly journalpublishing. Journal of Documentation, 56(4), 361-398

Rodriguez f Gairfn, J.M. (1997). Valorando el impacto de la informacion en Internet:AltaVista, el "Citation Index" de la Red, Revista Espanola de DocumentationCientifica, 20, 175-181.

Rudner, L. M., Gellmann, J. S., & Miller-Whitehead, M. (2002). Who is reading on-lineeducation journals? Why? And what are they reading? D-Lib Magazine, 9(12).Accessed June 7, 2004. Available:http://www.dlib.org/dlib/decemberO2/rudner/12rudner.html

Smith, A.G. (1999). A tale of two web spaces: Comparing sites using web impact factors.Journal of Documentation, 55(5), 577-592.

Snyder, H. & Rosenbaum, H. (1999). Can search engines be used as tools for web-linkanalysis? A critical view. Journal of Documentation, 55(4), 375-384.

Sweeney, A.E. (2000). Tenure and promotion: Should you publish in electronic journals? TheJournal of Electronic Publishing, 6. Accessed June 9, 2004. Available:http://www.press.umich.edu/jep/06-02/sweeney.html

Tang, R. & Thelwall, M. (2004/5, to appear). A hyperlink analysis of US public and academiclibraries' Web sites, Library Quarterly.

Thelwall, M., Vaughan, L., Cothey, V., Li, X. & Smith, A.G. (2003). Which academicsubjects have most online impact? A pilot study and a new classification process,Online Information Review 27(5), 333-343.

Thelwall, M. (2001). The responsiveness of search engine indexes, Cybermetrics, 5(1),http://www.cindoc.csic.es/cybermetrics/articles/v5ilpl.html.

Vaughan, L. & Hysen, K. (2002). Relationship between links to journal web sites and ImpactFactors. Aslib Proceedings: New Information Perspectives, 54(6), 356-361.

Vaughan, L. & Shaw, D. (2003). Bibliographic and web citations: What's the difference?Journal of the American Society for Information Science and Technology, 54(4),1313-1324.


Vaughan, L. & Thelwall, M. (2003). Scholarly use of the web: What are the key inducers oflinks to journal web sites? Journal of the American Society for Information Scienceand Technology, 54(1), 29-38.

Vreeland, R.C. (2000). Law libraries in hyperspace: a citation analysis of world wide websites. Law Library Journal, 92(1), 9-25.

Wouters, P. & de Vries, R. (2005, to appear). Formally citing the web. Journal of theAmerican Society for Information Science and Technology.

Zhao, D., & Logan, E. (2002). Citation analysis using scientific publications on the Web asdata source: A case study in the XML research area. Scientometrics, 54(3), 449-472.

Search Engines and Web Design 119

IV APPLICATIONS

12

SEARCH ENGINES AND WEB DESIGN

OBJECTIVES

• To introduce design considerations for effective site coverage by search engines.• To describe the link-based search engine algorithms HITS and PageRank, and to

explore implications of these algorithms for web site design.

INTRODUCTION

Web site designers and information professionals can benefit from knowledge of how searchengines operate. They need to ensure that their web sites are found and indexed by the majorsearch engines in order to attract new visitors. Web site designers also need to make sure thattheir sites are ranked reasonably in the results returned by search engines for relevant queries.Both of these needs relate to links, and designers can therefore benefit from knowing howsearch engines use links. The first part of this chapter deals with the basic issues of finding,crawling and indexing web sites, and the second introduces two important search engineranking algorithms.

LINK STRUCTURES AND CRAWLER COVERAGE

Web site designers need to take into account the way that commercial search engine webcrawlers operate if they wish to attract new visitors. Little is known about the specific detailsof commercial search engine crawlers, but knowledge about how web crawlers in generaloperate (<chapter 2) gives some guidelines.


Crawlers follow links extracted from the text of web pages. It follows that any website should contain enough links for a crawler to be able to find all the site's pages. Morespecifically, since a crawler will probably start at the home page, the whole site should beable to be crawled by following links from the home page. Additionally, the links must be inthe HTML of the web pages, in standard HTML (i.e. using the anchor tag) because thecrawler may not find links in other formats, including those in Java or JavaScript programs.

Web site designers should ensure that search engine crawlers can find their sites, sincemany sites are never found (Thelwall, 2000). Some search engines have an URL submissionfacility to allow web site owners to add URLs to the crawl list. This facility should be usedfor all the major search engines. Some commercial search engines do not have an URLsubmission facility, but will only visit a site if they find a link to it on another site that theyalready index. As a result, it is also important to make sure that all new sites are linked to byat least one site that is already crawled by the major search engines.

TEXT IN WEB SITES AND THE VECTOR SPACE MODEL

The text in a web page is very important for search engines, more important than for humanusers. Web page text is not related to the link theme of this book, but is important to give arounded impression of search engine-aware web site design. At this stage it is useful to makea distinction between crawling and indexing. Crawling a site means visiting its pages byfollowing links and (normally) saving a copy of each page. Indexing is a stage that followscrawling and means incorporating the text of the web pages into a special 'inverted index'database format that allows fast searching for pages matching a user's query. Standard searchengine indexes normally only contain web page text, ignoring graphics and other multimedia.Thus, when discussing search engines, it is important to differentiate between text andgraphics.

A person visiting a web site may be able to tell what it is about from the images ratherthan the text. To give an extreme example, just the presence of for a well-known companylogo in a prominent position might be enough to reveal the page owner. In contrast, for searchengines all images are ignored, at least in terms of extracting meaning from them. If a webpage is full of pictures of shoes but does not contain the word "shoe" in its text somewhere,then a search engine cannot know that the page is about shoes. The lesson is that webdesigners need to ensure that their pages contain words that describe their site's contents.

There are some clever ways that search engines may use to guess that the page isabout shoes. If a shoe pictures page is linked to by other pages that mention shoes near thelink, then this may be used as evidence of a shoe connection, an approach pioneered byGoogle (Brin & Page, 1998; cf. Bharat & Mihaila, 2001; Rafiei & Mendelzon, 2000).Alternatively, there are some sophisticated techniques such as latent semantic analysis(Deerwester, Dumais, Furnas et al., 1990) for guessing the topic of a page from synonyms orother words associated with the topic. For example, if the shoes page contained the word'boot' or a few other shoe-related words such as 'laces' and 'soles', then this may be enoughto identify it as shoe-related. Nevertheless, a web designer cannot rely upon all search enginesusing advanced techniques and so it is common sense to ensure that key pages have plenty oftext relating to the topics for which they would like to attract visitors. It is important that thistext is in the HTML of a site's web pages and not only in images, or other media such as Javaprograms or multimedia presentations.


Some information about how search engines process web page text is useful to understandhow they find relevant pages. All commercial search engines seem to use a technique knownas the vector space model (Salton & McGill, 1983; Baeza-Yates & Ribeiro-Neto, 1999), or avariation of it, in conjunction with other approaches. The model converts all web pages into"bags of words" in no particular order. Each page is represented primarily by a list of howoften each of its words occur. These word counts are later converted to weights using amathematical formula that gives higher weights to words that occur often in a documentcompared to other words in the same document. It also gives higher weights to rare words,ones that do not occur very often in other web pages. See the appendix to this chapter formore details of the vector space model. The vector space model is a useful primary methodfor search engines because it allows efficient searching of large numbers of pages to identifythe ones that are most relevant to specific words. The fast searching capability comes from theconstruction of the inverted index alluded to above, which is a list of relevant pages for eachword. Terms in a query can be looked up rapidly in an alphabetical list using a binary searchtechnique. Each term record will point to a reference to each page where the term is found,and the associated weights of the terms generated for that page. A variant of the cosinesimilarity measure (see appendix) is then commonly used to help rank the pages relative to asimilarly weighted query term vector, often with some similarity threshold and in conjunctionwith other techniques, such as PageRank or HITS (see below). The lesson for web sitedesigners is not only the need to ensure that pages contain words that potential new visitorsmay use to search for the site, but the words may need to occur several times, particularly ifthey are common words, to convince the search engine algorithm that the page is stronglyrelated to the word. Meaningless word repetition should not be attempted, however, as this islikely to get the page flagged as spam and excluded from search engine indexes.

THE PAGERANK ALGORITHM

Link-based ranking algorithms now seem to be widely used in commercial search engines.Web site designers can benefit from an understanding of how these algorithms work, in orderto design sites that are ranked effectively by search engines. The success of Google meansthat its PageRank algorithm is of particular interest, and it is one of the few published rankingalgorithms. Although highly mathematical, PageRank admits a simple underlying explanationthat allows an analysis of its impact upon Web spaces. It is likely that some other searchengines use algorithms that give very similar results to PageRank, although they do notpublish any information about them. HITS is a different link-based ranking algorithm thatgives an informative contrast to PageRank. The following quote encapsulates the value oflinks to search engines.

"By analyzing how pages link to each other, a search engine can both determine whata page is about and whether that page is deemed to be "important" and thus deservingof a ranking boost." (Sullivan, 2001)

A search engine that ignores links in its ranking process may use a formula based upondiscovering the frequency of the keywords of a user-entered search in each potential matchingdocument. For example, a search for "zoology" would be likely to return pages with this wordin the document title, main headings and body, and perhaps also in its URL. It would not bepossible for the program to guess which was the most authoritative zoology page, only the


one that was in some sense richest in zoology-related text. PageRank, on the other hand,would guess authority based upon link structure, perhaps ranking highest the page that wasthe most frequent target of links. This would make it far more likely that a genuinelyrespected page would be returned rather than, say, a zoology course timetable.

The core of Google's PageRank™ algorithm has been published by its designers andfounders Brin and Page (1998) and subsequently described in more detail by Page, Brin,Motwani and Winograd (1999). The same algorithm was still in use in 2004 although it is partof a much larger set of "more than 100 factors" used to decide which pages best match auser's queries, and how to rank them (Google, 2004b). Google's official statement is, "whilewe have dozens of engineers working to improve every aspect of Google on a daily basis,PageRank continues to provide the basis for all of our web search tools" (Google, 2004a). Thefollowing two basic ideas underlie PageRank.

• Inlinks are good indicators of the importance of the target page.• Inlinks from more important pages are better indicators of importance than inlinks

from less important pages.

PageRank is described in the rest of this section and a worked example is in the next section.The voting metaphor used on the Google web site (Google, 2004) and elsewhere (Lifantsev,2000; see also chapter 3) is used here instead of the equivalent original (Brin & Page, 1998)'random surfer' explanation.

A simple link-based voting system would be to give each web page a vote, allowing itto split its vote evenly (in fractions) amongst all the pages to which it links. Counting votesfor pages would form a ranking system, with the pages having many inlinks tending to get themost votes. This voting system does not go far enough, however. Popular link list pages, forexample, will gather many votes if they are well linked to, but will only have one vote toshare between their link targets, which presumably contain the valuable content. It makessense to repeat the process again, allowing each page to pass on votes acquired in the previousround to its target sites. Unfortunately, however, repeated voting does not work because votesget trapped in circular voting loops, or stuck at pages without outlinks (Brin & Page, 1998).

The solution of Brin and Page is to recycle a percentage of the votes at each stageinstead of sending them to link targets. They suggest the figure of 15% so that at any votingstage 85% of each page's votes are allocated to its link targets and 15% are distributed evenlyto all URLs in the system. A mathematical algorithm can efficiently implement this votingsystem, producing PageRanks by repeating the voting process until the PageRank votes for allpages stabilize, i.e. do not change much with each new round of voting.

Figure 12.1 illustrates a very simple system of four linked pages, with arrowsrepresenting links between pages. Assume that these four pages are part of the whole web, butnot connected to any other pages. Assume also that 1 vote is given to each page at the start ofeach PageRank round. In reality, these votes are always fractions derived from the totalredistributed unused votes of the 'whole' web. The number on each arrow is the vote thateach page allocates through its links, and the number in each circle is the vote possessed bythe page before allocating the votes indicated on the arrows. Note that the vote that a page hasin one round is not added to its vote from the previous round, it is the sum of only theincoming votes and its share of the redistributed votes.


Figure 12.2. Round 2 of PageRank voting.

In Figure 12.1a, the vote from the middle page is shared between the two pages towhich it links, hence the multiplication by 0.5 in the calculation. The numbers in the circles inFigure 12.1b show the results of the round 1 votes. Each page has the single vote allocated to


all pages at the start of each voting round and the three inlinked pages have additional votesfrom their inlinks.

Figure 12.2a also shows the votes to be allocated in round 2. The only difference invotes from round 1 is that the middle page has a higher vote to allocate (1.85 votes instead of1 vote). In the Figure 12.2b circles, the higher vote from the middle page has resulted in ahigher vote for the end pages. All subsequent voting rounds will leave the system unchanged,so this is the final set of votes for the pages.

There are two potential modifications to PageRank that could make a significantdifference, and may be actually implemented by Google. The first is to operate on the basis ofWeb sites rather than pages, as suggested by Lifantsev (2000). The second is to implement anearly suggestion of Page, Brin, Motwani and Winograd (1999, pi 1-12) to automatically givehigher votes to web site home pages. It seems possible that both of these are used by Google,perhaps in conjunction with its main page-based standard algorithm, or perhaps replacing it.No firm evidence is known for this, however.

CASE STUDY: PAGERANK CALCULATIONS FOR A GATEWAYSITE

This gateway site case study serves the dual purpose of illustrating PageRank and providing apractical example of the implications of web site design decisions on PageRank calculations.For the purposes of this example, gateway sites are just lists of links to useful resources for agiven subject or topic. It seems that the early days of the web saw many individuals creatingtheir own link lists but now there seems to be an increasing trend away from this, at least ineducation, and towards reliance upon official (or widely recognized unofficial) gateway sites.Figure 12.3 illustrates a situation without a gateway site where there are 10 link lists, alllinking to 100 pages with some kind of useful content. For the sake of simplicity, assume thatno other web page links to any of the pages in this set.

At the start of the PageRank voting, assume again that each page is given one vote. Invoting round one, each page votes and then is given an additional vote. It is again assumedthat one vote is redistributed to each page in each voting round, ignoring the fact that the totalnumber of votes to be redistributed varies in each round. Pages that link to other pages split85% of their vote evenly amongst all target pages. Pages without any link targets forfeit theability to vote for other pages and their 85% is lost along with the 15% that all pages looseanyway. In subsequent rounds the same voting patterns recur, and so the votes after round onestay the same. In more complex systems of links, additional voting rounds would be needed todetermine the eventual distribution of votes.

The only voting in this system is from the 10 link list source pages to the 100 targetcontent pages. Each source page starts with one vote, and is allowed to vote 85%, so has 0.85to vote with. This fractional vote has to be shared with all 100 link targets and so each onegets a hundredth of 0.85, or 0.0085. Now each target page receives a vote of 0.0085 from allof the 10 source pages, a total vote of 10 x 0.0085 = 0.085. The net result is that the targetpages end up with a vote of 1 + 0.085 = 1.085, after adding the one vote that is given to everypage. This is illustrated in Table 12.1. Note that the votes at pages without links are actuallyused to increase the value of the vote redistributed to all pages at the start of each round, butfor the example shown, this increase is insignificant.

Figure 12.3. A link structure without a gateway site.

Table 12.1. PageRank calculations.

Pages

Source pages 1... 10Target pages 1...100

Redistributedvote

11

Vote received bypage

0

10x°-85 =0.085100

Total vote atthe end ofround one

11.085

Figure 12.4 shows the same set of source and target pages after the inclusion of a gatewaysite. This operates under the assumption that all link list source pages switch from linking toall the target content pages and link just to the gateway page instead. PageRank shouldcompensate for the reduced inlinks to the target pages with an increased weight for the inlinksfrom the gateway page.

Figure 12.4. A link structure with a gateway page replacing direct links

At the start of PageRank, each page begins with one vote. In voting round one, the sourcepages give 85% of their starting vote to the gateway page, which splits 85% of its originalvote amongst all the target pages. In this round, the target pages have a low vote because thefact that the gateway page has many inlinks has not fed through the system yet. In round twothe same process recurs with the new votes but in this case the increased gateway site votefeeds through to the target pages. In subsequent rounds the same voting patterns recur, and sothe round two votes do not change again.


Table 12.2. PageRank calculations for the gateway system, round one.

Pages

Source pages 1... 10GatewayTarget pages 1 ...100

Redistributedvote

111


010x0.85x1 = 8.50-85x1 _ 0 0 ( ) O 5

100

Total vote atthe end ofround one

19.5

1.0085

As can be seen from Table 12.2 and Table 12.3, the main effect of introducing a gateway siteunder these idealized conditions is that the gateway site has a much higher PageRank than anyindividual content pages. A secondary effect is that the value of the PageRank of the contentpages drops slightly, by 5% in the above example. If only a proportion of pages switch fromlinking to all the content pages to linking to a gateway site then the PageRank changes inranking will not be as extreme as illustrated here, but will still be in the same direction. Somelink lists will probably disappear rather than remain with a single link to the gateway site, iftheir reason for existing has been taken away. This would reduce the PageRank of both thegateway site and the target content pages.

Table 12.3. PageRank calculations for the gateway system, round two.

Pages

Source pages 1... 10GatewayTarget pages1...100

Redistributedvote

111


010x0.85 = 8.5

0-85x9.5100

Total vote atthe end ofround two

19.5

1.08075

In real world examples, each source page would probably link to only a subset of the gatewaysite targets, with presumably the differing overlaps, giving the highest PageRank to the mostuseful pages. The gateway site, under real conditions, will tend to counteract this byallocating the same vote to each target site irrespective of quality or usefulness. This willreduce the range of PageRanks between the various content pages. This is an undesired affect,because differentiating between pages based upon their usefulness or quality is the objectiveof PageRank. This switches the responsibility for identifying and highlighting the best websites from the 'democracy' of individually-created web links to the 'dictatorship' of thegateway site owner. This is not necessarily a bad thing, especially if the gateway site owner isa subject specialist or trained information professional, but this shift in power may not be anobvious consequence of gateway site creation.

The number of gateway site pages that must be traversed to reach the content pagelinks affects their PageRanks. The above calculations have used the assumption that thegateway site is a single page. In reality, most have several pages that must be clicked throughfrom the home page before the external links are met. Each additional page 'looses' 15% ofthe site inlink votes, but adds one vote (for itself). Unless the gateway site gets a very highnumber of links (e.g. yahoo.com), the mathematics of this argues for more pages to clickthrough on the site, and fewer links per page because it will give the targeted content pages a


higher PageRank. This is undesirable because it gives the site user more work to do, ausability problem (Nielsen, 2001). Gateway designers have to balance usability withattractiveness to search engines.

In summary, the introduction of a gateway site to organize a subject area on the webmay well have two long term effects: the gateway site itself should gain a high PageRank, andPageRanks for content pages in the subject area should become more evenly distributed.Individual web page authors should continue to create links to subject gateways that organizeuseful resources because these are likely to become a first port of call for those searching theweb for subject-specific resources. They should also continue to link directly to pagesconsidered to contain high quality content in order to help differentiate them in search enginerankings.

HITS

The HITS (Hyperlink Induced Topic Search) algorithm is another link-based algorithm forsearch engines, one that is designed to find web pages that are most useful for a user's query(Kleinberg, 1999). HITS or something very similar seems to be the basis for the search engineTeoma's (www.teoma.com) Subject-Specific Popularity™ (Teoma, 2004). HITS, unlikePageRank, is not applied at once to the whole web. It only ranks a subset of pages that arejudged to be relevant to the query entered by a user into a search engine, and therefore mustbe calculated separately for each user query. Another difference is that HITS discriminatesbetween two desirable types of pages. The first are authorities, pages that are linked to bymany relevant pages. These are pages with a high relevant inlink count. The second desiredtypes of page are hubs, pages that are the sources of many relevant links. Authorities aresimilar to the pages that are highly ranked by PageRank, but PageRank ignores hubs. TheHITS rationale for seeking hubs is that pages that link to many topic-relevant pages may bevery useful to search engine users that are seeking information on the topic. In summary,there are two main differences between HITS and PageRank.

• HITS is query-specific, but PageRank is query-independent.• HITS identifies both good link sources and good link targets whereas PageRank only

identifies good link targets.

Figure 12.5. A hub and an authority.

Figure 12.5 illustrates a network of interlinked pages with shaded pages being topic-relevant. Page A is an authority because multiple topic-relevant pages link to it and page H is


a hub because it links to multiple topic-relevant pages. Page 1 is not an authority, eventhough multiple pages link to it, because only one of them is topic-relevant.

HITS WORKED EXAMPLE

The HITS algorithm is a little more complex than the PageRank algorithm, but its essence isdescribed in simplified form, and a worked example given.

As can be seen from the description of HITS in the Box below, the first stage is toidentify a set of potentially topic-relevant pages. Topic-relevance is assessed individually foreach user's query submitted to a search engine. Pages are judged to be topic-relevant if they

a) are in a set of the pages that contain text that is most relevant to a user's query, orb) link to one of the pages found in (a), or are linked from one of the pages found in (a).

There is an important link assumption here that is partially supported by the link-contenthypothesis (<chapter 6). This assumption is that a page may be relevant to a topic if it isconnected by a link to a topic-relevant page, even if it does not contain text that matches thetopic, at least in terms of the user's query text. Judging by the link-content research reviewedin chapter 6, this will be true some of the time, but not always. Of course, pages judgedrelevant by their text content will also sometimes be inappropriate because topic-relevancecan be very difficult to determine in practice, especially as some queries are inherentlyambiguous. An example of this is the classic query 'jaguar', which could be entered bysomeone searching for information about the animal or the car with this name. In summary,the topic-relevant pages are likely to be, in practice, incomplete and partially relevant.Nevertheless, Kleinberg's (1999) demonstrations show that this is not necessarily a fatalproblem.

The second part of the algorithm calculates hub and authority values for each page inthe topic-relevant set. The algorithm uses voting in a similar way to PageRank, but also uses akind of reverse voting so that pages vote for pages that link to them. The end result of theHITS algorithm is a hub value and an authority value assigned to each node, rather than thebinary distinction between hubs and authorities described above.

The example in figures 12.6 to 12.11 illustrates steps 4 to 7 of the algorithm for a verysimple network.


Figure 12.6. Step 4. Initial hub (top) and authority (bottom) values.

Simplified HITS algorithm

Stage I: Find a set of query/topic-related pages.1. From a search engine user's text-based query, find t pages with text relating

most closely to the query, where t is some predefined parameter (Kleinberg'sroot set).

2. Add all pages linked from or linking to the matching pages (Kleinberg's baseset).

3. Remove all links between pages within the same site.

Stage II: Initialize the hub and authority values of each page.4. Assign each page an authority weight x and a hub weight y, e.g. x = y = 1.

Stage III: Iterate voting procedures.5. Calculate the authority weight of each page by totaling the hub values of each

page from which it is linked.6. Calculate the hub weight of each page by totaling the authority values of each

page to which it links.7. Normalize hub values by dividing all by the highest hub value. Normalize

authority values by dividing all by the highest authority value (Kleinberg usesa more complicated calculation).

8. Repeat steps 5 to 7 for a set number of iterations (Kleinberg suggests 20).

Stage IV: Reporting results9. Return a ranked list of pages, combining those with high hub values with

others having high authority values so that the user can choose which type isbest (Kleinberg suggests the best 5-10 hubs and the best 5-10 authorities).


Figure 12.7. Step 5. Voting with hub weights to authorities of linked pages.

Figure 12.8. Step 6. Voting with authority weights to hub weights of linking pages.

1/20/1

Figure 12.9. Step 7. Normalizing by dividing hubs by the highest hub value, 2, andauthorities by the highest authority value, 1.


Figure 12.10. The results of one application of steps 5 to 7.

From Figure 12.11, the pages that link to other pages have the highest hub value, and thepages that are linked to have the highest authority value. A second iteration of steps 5 to 7will change the hub and authority values, but the same pattern of high and low values willremain, as shown in Figure 12.12.

Figure 12.11. The results of a second application of steps 5 to 7.

Two further differences can be seen between PageRank and HITS, by inspection of thealgorithm.

• Votes (steps 5 and 6) are not divided between link targets (or link sources), full votes aregiven to all linked pages.

• There is no loss of a percentage of votes, or reallocation of unused votes.

Finally, it is likely that good hubs will also become good authorities because they will getlinked to for their hub value, as discussed in the PageRank gateway sites case study. Forexample the 5th most highly linked to UK academic page in 2001 was a clickable map of linksto all UK university home pages (Thelwall, 2002b).

SUMMARY: WEB SITE DESIGN FOR PAGERANK AND HITS

In this chapter, PageRank and HITS have been discussed separately, and Google has beendiscussed as if it is the only search engine operating anything like PageRank. Althoughranking algorithms are kept secret, the reality is almost certainly more complex than this.Probably most search engines use some form of link based ranking, and adopt elements of thePageRank and HITS approaches, in combination with other sources of data. These other


sources will include the degree to which any particular query matches the text in the page,how often the target page is updated (i.e. the likely freshness of its information) and perhapseven the frequency with which users who type a given query click on links to the pagesreturned for the query. The following recommendations for the search engine visibility ofsites are based mainly upon the analysis of PageRank and HITS, and repeat the basic link andtext recommendations given in the first few sections.

• Web sites should be designed to be fully crawlable by a search engine starting at the homepage and only able to find standard HTML links.

• Web pages should contain words in the text of their HTML that are relevant to theircontent, and particularly words that potential visitors may use in their searches.

• Web sites should be created at the earliest possible opportunity. Age is important in theindexing and ranking of web pages. PageRank is biased against older pages (Baeza-Yates,Castillo & Saint-Jean, 2004), and newer pages are less likely to be found by searchengines because they are less likely to have site inlinks (Vaughan & Thelwall, 2004).

• Web site URLs should be kept as stable as possible. Changed URLs can result in brokenlinks, which will mean that visitors are lost by being unable to follow the broken link, andPageRank will be lost from the missing page.

• Web sites should try to get linked to by other sites, particularly popular ones such asYahoo!. For example, including free useful content on a site is a logical way of giving anincentive for others to link to the site. Inlinks to the site will help to generate a higherPageRank value.

• Web sites should try to get linked to by other sites relating to the same topic. Relevantlinks are more useful for improving topic-specific ranking, as with HITS.

• No attempt to spam search engines should be made. For example creating artificial webpages just to host links to the site, or creating large numbers of links in other ways risksthe site being banned completely from search engine indexes.

FURTHER READING

The best place to look for up-to-date information about how search engines work and how toensure that web sites are designed with search engines in mind is searchenginewatch.com.

For those interested in the computer science of the link algorithms, modifications toPageRank have been suggested by information retrieval researchers, for example restrictingthe ranking calculations to topic-specific pages (Richardson & Domingos, 2001) and themodification of competitive algorithms to have PageRank-like qualities (Ng et al., 2001).

Values related to Google's PageRank can be obtained by installing the GoogleToolbar from its Web site, but note that these are not raw PageRank figures. Visiting varioussites to test their PageRank-related values in the tool bar is a useful exercise.

The PageRank part of this chapter is a modified, corrected and expanded version of apreviously published article (Thelwall, 2002a), and the appendix is from Price & Thelwall(2004).


APPENDIX: THE VECTOR SPACE MODEL

The mathematics of the vector space model (VSM) is described here for completeness. Thevector space model is a standard information retrieval approach for document representationand for use in document relevance ranking. As mentioned above, each document isrepresented by a list of its words and their frequencies. Clearly, much information is lostbecause the order of the words is not recorded. For example, if the words "New" and"Mexico" occur in a document then it is much more likely that the document relates to NewMexico if the words are known to be consecutive. Nevertheless, the vector space model is auseful representation of documents because it allows efficient searching and clustering.

The first step in applying the VSM to a set of documents is to construct a vocabulary,a list of all words found in any of the documents. The individual documents are thenconverted to word frequency vectors (simple lists of numbers) by recording the frequency ofeach of the words in the vocabulary. Normally, any document will only contain a smallminority of the words in the vocabulary and therefore most of the word frequencies will bezeros. The VSM exploits word frequency information in order to generate weights for all ofthe words in a document. These estimate the relevance of each word to the document. Amathematical formulation follows.

Let X be a set of n documents. Let m denote the total number of unique words, and letn, be the number of documents containing word i. Let fy be the frequency of word / indocument j with//max being the maximum frequency of any word in document/

The standard VSM weighting for a word i in document j is w.. = ' 1OJ — . Thus,

words in a document are weighted highly if they occur in few documents (i.e., n. is low, so

— is high and hence log — is high) and have a high frequency relative to other words in the

document (i.e. " is high). The VSM represents documents by word frequency weight

vectors, so that document j is represented by the vector (w,j),=;...n. The distance between twodocuments is commonly measured with the cosine measure, defined below for documents jand/ (Baeza-Yates & Ribeiro-Neto, 1999).

ttW"W'r

The cosine measure gives values between 0 and 1, with documents that contain similar wordstending to have a similarity close to 1, and documents with few words in common tending tohave a similarity close to 0. One of the documents can be replaced by a search engine query,so that the cosine measure assesses the distance between the query and the document. Thissimple measure could then be used to rank documents by how 'close' they are to the user'squery. This is in contrast to a Boolean search, which matches all documents that contain theuser's query. If a user conducts a search for 'introduction to cybermetrics' in a Boolean


library catalog then the catalog will return a list of books containing the words 'introduction'and 'cybermetrics' in their title, ignoring the common word 'to'. The matches will not beranked in order of relevance, but will probably be returned in alphabetical order or publicationdate order. In contrast, a web search for 'introduction to cybermetrics' sent to a search engineusing the VSM would return a list of web pages containing 'introduction' and 'cybermetrics'(and possibly 'to') but would rank them in order of how relevant they were to the query. Webpages containing more occurrences of 'introduction' and 'cybermetrics' (compared to thehighest frequency word in the page .//max) would be judged more relevant, and the number ofoccurrences of the word 'cybermetrics' would have a greater impact than the number ofoccurrences of the word 'introduction' because the latter word occurs in many more webpages.

In practice, a different weighting scheme is normally used for queries than documents(Baeza-Yates & Ribeiro-Neto, 1999), and more sophisticated versions of the cosine measureused - perhaps incorporating links and the position of text in the document. This ranking isused in step 1 HITS, and is used by Google in conjunction with PageRank.

REFERENCES

Baeza-Yates, R., Castillo, C, & Saint-Jean, P. (2004). Web dynamics, structure and pagequality. In: M. Levene & A. Poulovassilis (Eds.) Web dynamics. Springer: Berlin (pp.93-109).

Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern information retrieval. Wokingham, UK:Addison-Wesley.

Bharat, K & Mihaila, G.A. (2001). When experts agree: Using non-affiliated experts to rankpopular topics. Tenth International World Wide Web Conference. Available:http://www.www 10.org/cdrom/papers/474/index.html

Brin, S. & Page, L. (1998). The anatomy of a large scale hypertextual web search engine.Computer Networks and ISDN Systems, 30(1-7), 107-117. Available athttp://www7.scu.edu.au/programme/fullpapers/1921/coml 921 .htm

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer T.K., & Harshman, R. (1990).Indexing by latent semantic analysis. Journal of the American Society for InformationScience, 41(6), 391-407.

Google (2004a). Our search: Google technology. Available:http://www.google.com/technology/index.html

Google (2004b). PageRank information. Available:http://www.google.eom/webmasters/4.html

Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment, Journal of theACM, 46(5), 604-632.

Lifantsev, M. (2000), Voting model for ranking web pages. In: Graham, P. & Maheswaran,M. (eds), Proceedings of the International Conference on Internet Computing, LasVegas, Nevada, USA, CSREA Press, pp. 143-148.

Ng, A.Y., Zheng, A.X. & Jordan, M.I. (2001). Stable algorithms for link analysis. In:Proceedings of the 24th Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR 2001), New York: ACM Press, pp. 258-266.

Nielsen, J. (2001). Designing Web Usability: The Practice of Simplicity, New Riders.


Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking:Bringing order to the web. Available: http://dbpubs.stanford.edu:8090/pub/l 999-66

Price, E. & Thelwall, M (2004). The clustering power of low frequency words in academicwebs. University of Wolverhampton.

Rafiei, D. & Mendelzon, A.O. (2000). What is this page known for? Computing web pagereputations, Computer Networks, 33(1-6), 823-835.

Richardson, M. & Domingos P. (2001). The intelligent surfer: Probabilistic combination oflink and content information in PageRank. Poster at Neural Information ProcessingSystems: Natural and Synthetic 2001. Available athttp://www.cs.washington.edu/homes/mattr/doc/NIPS2001/qd-pagerank.pdf

Salton, G. & McGill, J. (1983). An introduction to modern information retrieval. New York:McGraw-Hill.

Sullivan, D. (2001), How search engines rank web pages. Available athttp://www.searchenginewatch.com/webmasters/rank.html

Teoma.com (2004). Adding a new dimension to search: The Teoma difference is authority.Available: http://sp.teoma.com/docs/teoma/about/searchwithauthority.html

Thelwall, M. (2000). Commercial Web sites: Lost in cyberspace?, Internet Research:Electronic Networking and Applications, 10(2), 150-159.

Thelwall, M. (2002a). Subject gateway sites and search engine ranking, Online InformationReview, 26(2), 101-107.

Thelwall, M. (2002b). The top 100 linked pages on UK university Web sites: high inlinkcounts are not usually directly associated with quality scholarly content, Journal ofInformation Science, 28(6), 485-493.



A Health Check for Spanish Universities 137

13

A HEALTH CHECK FOR SPANISHUNIVERSITIES

OBJECTIVE

• To present a case study investigation into the university web sites of a single country.

INTRODUCTION

This chapter is a case study health check of the Spanish academic web, using an investigationinto 64 university web sites crawled by SocSciBot. It is vital for the health of the researchbase in any country that its academics are able to make effective use of the web, especiallybecause of the significance of international collaboration in research. It is clear, however, thatin some countries the web is comparatively underused, and so a health check of nationalacademic web use is a logical step for any government. Analyzing a national academic webalso gives the opportunity to extract information about patterns of informal scholarlycommunication by tracking the targets of Web links. How international is the perspective ofacademic web page authors? What kind of resources do they target? Which countries have theclosest connections?

This chapter is based upon a paper previously published in Spanish (Thelwall &Aguillo, 2002).

RESEARCH QUESTIONS

A key issue is how Spanish university web use compares to that of other countries.

• Are the sizes of Spanish university web sites broadly similar to those of otheradvanced nations?

The following questions are of a more exploratory nature.


• Are web links spread evenly between university web sites or do a few attract adisproportionate share?

• Which countries and web top-level domains attract most links from Spanishuniversities?

• What type of pages are the most highly targeted by links from Spanishuniversities?

Finally, from a theoretical perspective, the following questions are interesting.

• Are the ADMs useful to analyze links in the Spanish academic web?• Is there a linear relationship between the number of links to a university and the

size of its web site?

METHODS

A list of Spanish university web sites was obtained from an international directory(http://geowww.uibk.ac.at/univ/world.html) and then these were all crawled by a version ofSocSciBot. In some cases the domain names were incorrect or had changed and the necessarycorrections were manually identified. The coverage of the crawler in each case was all pageswith domain names ending in any of the known official university canonical domains. Forexample, the University of Salamanca (www.usal.es) domains included many different onesalthough all ended in .usal.es. During the crawling process, the software was monitored in anattempt to avoid downloading "mirror sites", collections of pages copied from anotherlocation. In addition, web-based email archives and discussion lists were also avoided, whenidentified. The rationale for these steps is that the primary concern of the study is web pagescreated by the host institution. In the nature of the task, however, with millions of pagesprocessed and no easy failsafe method of identifying unwanted areas, only the largercollections of pages would have been spotted during the crawling process. Once it wascomplete, the link databases were analyzed again using a set of software to identify likelyunwanted collections of pages, which were then manually vetted and eliminated. This secondstage led to a 20% reduction in page counts. Typical offenders were copies of multiple-pageonline manuals for programming languages or other software.

The link databases produced by SocSciBot were processed by the associated softwaresuite in order to extract summary statistics (>chapters 18, 20). The database used is availableat http://cybermetrics.wlv.ac.uk/database/.

RESULTS AND DISCUSSION

As can be seen from Figure 13.1, site sizes varied greatly and there were many universitieswith very small web sites. The largest was the University of Valencia's web site. Table 13.1gives some basic statistics, set alongside similar figures extracted from comparable countriesfor which databases are available from the same crawler(http://cybermetrics.wlv.ac.uk/database/). It is difficult to make direct comparisons betweencountries because of their sizes and the variety of higher education systems. Of particular


concern for averaging statistics, one country may have many small universities, whereasanother may have a few large ones. To compensate for this, total web size figures arepresented normalized for population size. The Spanish web seems to lag behind the othercountries on a per head of population basis, and significantly behind Australia.

Figure 13.1. The range of university web site sizes for the 64 Spanish universities crawled.

Table 13.1. Web statistics for selected countries. Population figures came fromWorld Factbook (http://www.odci.gov/cia/publications/factbook/).

Total pagesMedian pages persiteLargest site sizeUniversities crawledPopulationWeb pagespopulation(Spain=100)

Spain2002

2,12,394325,918

135,17764

40,077,100100

Australia2001

2,726,47564,153

251,36738

19,546,792263

UK2002

5,438,94129,757

367,140111

59,778,002172

Taiwan20022,150,690

20,738

280,53046

22,548,009180

the CIA

NZ2002315,14238,082

73,0448

3,908,037152

From Figure 13.2, there is a wide range of different inlink counts for Spanish universities, butthe spread is less than that for web site sizes (Figure 13.1). If a similar pattern were to bepresent for Spain as for the UK (chapter 8) then the number of inlinks per university would beroughly proportional to its total research productivity.


Figure 13.2. The range of university web site inlink counts for the 64 Spanish universities.

The relationship between site size and inlink counts is illustrated in Figure 13.3. From this itcan be seen that the larger web sites tend to attract more links, consistent with the model inchapter 8.

Figure 13.3. Site inlinks against site page counts.

In the UK, counting links between domains was found to give more reliable results (chapter3) and it can be seen that the results for the domain model (Figure 13.5) are less scattered thanthose for the file model (Figure 13.3). This is despite the smaller numbers involved for theformer model, which should result in more scattering, other factors being equal. This is anargument for the improved effectiveness of the domain model.

The broadly linear trends in figures 13.3 to 13.5 are consistent with web site size beingthe main factor in attracting links from other Spanish universities. The outliers on the graphare cases where this rule may be violated. A possible explanation is that large individual


collections of pages may be artificially inflating site sizes. This is supported by the morelinear trend in domain ADM inlinks shown in Figure 13.5.

The outlier on the right-hand side of the directory graph (Figure 13.4) is the Universityof Las Palmas in Gran Canaria, with 20,414 directories and only 1,019 directory-basedinlinks. This was due to an enormous multi-directory educational database "ProyectoDocente". There is also an outlier on the vertical axis of the domain graph, the University ofGirona, which is apparently based upon a single domain but attracts 577 domain-basedinlinks. This is the result of an error in the crawling: the main udg.edu domain was crawledbut all the subdomains end in udg.es, a fact that was spotted too late for inclusion in the crawl,but was factored in to the calculation of inlinks.

Figure 13.4. The number of unique directory-based site inlinks against the number of uniquedirectories for Spanish universities.

Figure 13.5. The number of unique domain-based site inlinks against the number of uniquedomains for Spanish universities.

The Top Level Domains (TLDs) of web sites linked to by site outlinks in the crawled Spanishuniversity web sites are summarized in Figure 13.6. The results are for the page model but the


domain model results (not shown) did not differ by more than 2% in any category, and so thenumbers appear to be quite robust. Note that all links between Spanish university web sitesare excluded from these figures. From Figure 13.6 the importance of the .com domain forSpain is evident. It attracts more links than even the official Spanish .es domain. MainlyAmerican universities (.edu) and the multiple use .org domain also attract many links. It isinteresting that the country domains represented are European or English speaking heavy webusing countries. None are Spanish speaking. This spread of link targets represents a veryhealthy wide international focus. The absence of Spanish speaking countries could beevidence for the multilingualism of Spanish web authors or for the lack of web publishing inSpanish speaking countries.

Figure 13.6. The most commonly targeted TLD domains by inter-site links originating at oneof the 64 university sites.

Table 13.2 gives information about the pages that attracted most links from Spanishuniversities. This list differs from a similar one from the UK, which only considered nationallinks (Thelwall, 2002) although there is one UK page that appears in both. Both lists alsocontain a page of links to all universities within the respective countries.


Table 13.2. A summary of the top 25 link targets for Spanish university site outlinks.[nlinks URL ending Page type and reason for inclusion in the list

www.mcu.es/bases/spa/isbn/ Spanish ISBN agency. Linked to by many library pages10713 ISBN.html for its ISBN search facility.

Search engine link included on many pages, some4937www.google.com automatically, but also widely used.

Zope is a leading open source application server.3335 www.zope.org/credits Automatic credit on a large collection of pages.

Squishdot: The Open Source Discussion Forum3274 squishdot.org Software for Zope. Many automatic credit links.

The web interface for a timetabling application. Many2835www.celcat.com/webpub.htmlautomatic credit links.

Ministry of public works and the economy. Linked to2509www.mfom.es by catalogue of good Spanish practices.

A protein structure mirror site but the Spanish host2087cubic.bioc.columbia.edu/eva university is part of the project.

Automatically included on collections of pages, but2026www.elpais.es widely used.1713www.csic.es National scientific research organization.1422www.boe.es Government reports.

Search engine link included on many pages, some1293 www.yahoo.com automatically, but also widely used.

Linked to by catalogue of good Spanish practices. TheMain directorate of Home, Architecture and Urbanism

1255 www.mfom.es/vivienda of the Ministry of Public Works and the Economy.Automatically included on collections of pages, but

1185 www.apple.es widely used.cbl.leeds.ac.uk/nikos/personal Linked to by many web pages that were converted by

943 .html the software described.sl2.sitemeter.com/stats.asp?si Web site counter, linked to by many pages on a site.

902te=sl2apardo833www.bestpractices.org Linked to by Catalogue of good Spanish practices.798www.lotus.com IBM Lotus Software793www.abc.es Newsmedia organization

www- Linked to by many Web pages converted by thedsed.llnl.gov/files/programs/u LaTeX2HTML software.

788 nix/latex2html/manualwww.rediris.es/recursos/centr Complete list of Spanish university web sites

735 os/uni v.es.html648www.mcyt.es Ministry of science and technology617validator.w3.org/check/referer HTML validation link.592www.uab.es Universitat Autonoma de Barcelona587www.bne.es National library of Spain578 www.mec.es Ministry of Education and Culture


Pages that are the target of automatically generated links dominate Table 13.2. It would bemore useful if the most highly targeted pages were there as a result of multiple links fromvarious different sources, rather than as an almost accidental by-product of the web pagedesign decision of an individual. Despite this, it is interesting to see several governmentministries represented as well as national resources such as the National library of Spain.Computing and press home pages also feature, probably mainly due to automatically createdlinks.

CONCLUSION

Spanish educational use of the web appears to be in a reasonably healthy state, both in termsof web site sizes and interlinking between universities. The international multilingualism ofweb link targets suggests an excellent external orientation and broad vision for web authors.Nevertheless, web use appears to lag behind similar countries worldwide, and an alternativeexplanation for the international multilingualism of link targets may be the lack of sufficientlocal and Spanish language web content.

One unusual feature of the results, not previously identified in previous research (butsee Leydesdorff & Curran, 2002) is that several government ministries were identified in theset of top 25 link targets. It would be interesting to see whether this phenomenon would bereplicated in studies of other countries or whether the close government-education connectionis a Spanish phenomenon.

This study has produced a range of statistics that shed some light on the Spanishacademic web. Perhaps the most useful were those comparing Spain with other countries(Table 13.1) but the overall set of statistics, or indicators, produced gives a good overallimpression of linking in Spain. This is an argument for the utility of web indicator creation asa tool to support science policy. It would be interesting to see the results of a large-scaleinternational study that compared a range of different countries with a set of relevant web-based statistics and explorations.

From a theoretical perspective, the results are also useful to compare with those of theUK-based studies that predominate in the remainder of this book.

REFERENCES

Leydesdorff, L. & Curran, M., (2000). Mapping university-industry-government relations onthe Internet: the construction of indicators for a knowledge-based economy,Cybermetrics, 4. Available: http://www.cindoc.csic.es/cybermetrics/articles/v4ilp2.html

Thelwall, M., & Aguillo, I. (2003). La salud de las Web universitarias espanolas. RevistaEspanola de Documentation Cientifica, 26(3), 291-305.

Thelwall, M. (2002). The top 100 linked pages on UK university Web sites: high inlinkcounts are not usually directly associated with quality scholarly content, Journal ofInformation Science, 28(6), 485-493.

Personal Web Pages Linking to Universities 145

14

PERSONAL WEB PAGES LINKING TOUNIVERSITIES

OBJECTIVES

• To present a case study of web link analysis to investigate relationships betweenuniversities and non-university web sites.

• To discover whether personal web pages that link to universities can yield informationabout the wider dissemination of research.

INTRODUCTION

This chapter reprints a previously published research paper (Thelwall & Harries, 2004b) togive a complete case study of an academic link analysis investigation. It has been modifiedprincipally to avoid repeating points made elsewhere in the book.

This case study investigates whether personal web sites outside of universities canyield new information about the wider dissemination and application of academic research.This is important because the public pays for universities in many countries and assessesthem, through their governments. Moreover, public support can be critical to the funding oflarge-scale research projects (Gibbons, 1999) and, for smaller-scale projects, researchers canuse evidence of public interest as a partial justification for the value of their funding bids. Asa result, both university and government can gain from understanding the processes throughwhich the public can make a significant engagement with science, and from developingmethods to explore and assess the extent of public interest in individual areas of research.

Various information sources are currently used to illuminate public perceptions ofresearch. For example, statistics concerning the rise of the popular science book genre is takenas evidence of a recent increase in interest in certain types of science (Weigold, 2001).Another information source, the subject of an entire research area, is science writing byprofessional journalists (Weigold, 2001). Measures of the extent of science coverage innewspapers would give information about specialisms attracting the most interest, albeitthrough the possibly distorting mirror of science writers' and editors' perceptions (or shaping)of their public's needs. Outside of this research area, media mentions have been used inconjunction with web invocations as evidence for the public fame of intellectuals (Landes &


Posner, 2000). Both of these sources (book sales and media coverage) have a major drawbackfor any detailed analysis that the web does not: the public involved is typically anonymous. Inorder to get information about science readerships, a consumer survey would be needed, evento obtain basic facts such as the proportion of the readership that are active researchers.Usenet newsgroups can give useful information about public participation in various diversetopics (e.g. Bar-Ilan, 1997; Fredrick, 1999; Hearit, 1999; Hine, 2000; Kot, Silverman & Berg,2003; Stubbs, 1999) but their topic-specific nature and anonymity of typical message postersmake them a problematic source of information for general science issues and when theconstituency is likely to be a minority of the active participants for a topic - presumablyacademic newsgroups are dominated by academics and students, rather than the generalpublic. The same may be true of personal web pages, but the greater information availablegives the potential to differentiate between publishing by the general public and by scholars.

Given that the web is potentially an information source that can reveal new facets ofpublic attitudes to academic research, perhaps including minority discourses and grassrootsmovements that are sidelined in the mainstream media (cf. politics: McCaughey & Ayres,2003), how likely is it that online evidence can be found in significant quantities? Twoseparate positive indications are recognitions of the importance of the web in researchdissemination (e.g., Lederbogen & Trebbe, 2003; Sloan, 2001; Sloan, 2002; Treise, Walsh-Childers, Weigold, & Friedman, 2003), and evidence of individuals' participation inindividual (scientific and non-scientific) high profile public issues through the Internet (Bar-Ilan, 1997; Hearit, 1999; Hine, 2000; Foot, Schneider, Dougherty et al., 2003; Garrido &Halavais, 2003), massive participation in the case of US politics (Bazerman, 2003).Nevertheless, since web use by academics is, and will continue to be, very varied (Kling &McKim, 2000), it is likely that non-academic science-related publishing will also be irregular.Despite this, there have been no large multi-issue studies of how the public engages withuniversities online, and so it is simply not known whether the web would be a good startingpoint for future researchers seeking to explore, or assess the impact of, any academic researchthat does not enjoy a particularly high profile. Individual case studies like those above do notaddress this question since the answers are likely to vary enormously on a case-by-case basis.A large-scale study of personal pages discussing academic topics is therefore an essential stepto guide future researchers as to the plausibility of issue-based approaches across thespectrum of academia. In addition, if impact assessments are to be made using this newinformation source, then the standard techniques of bibliometrics need to be invoked to gaugethe validity of the reporting statistics used.

WEB PUBLISHING AND PERSONAL HOME PAGES

A sociological perspective on web publishing, the "loose web" thesis, states that the web andInternet are unorganized and disparate so that caution should be exercised when attemptinggeneralizations (Burnett & Marshall, 2002). Hine's (2000) solution is to understand the webin terms of very specific contexts of use, an ethnographic perspective. This is the anti-technological determinism message of Kling and McKim (2000), who emphasize that the useof new technology by scholars is highly contingent on the specific needs of the discipline orfield in which they are working. This is now backed up by empirical evidence (Herring, 2001;Hammond & Bennett, 2002; Thelwall & Tang, 2003; Tang & Thelwall, 2003).


This study is concerned only with personal home pages outside of academic settings.It seems likely that personal home pages will epitomize the looseness of the web, being oftencreated as a recreational activity (Papacharissi, 2002) and not subject to any kind of formalquality control. Dillon and Gushrowski (2000) argue, however, that the personal home pagehas at least become a stable genre type (genre being regularities of form and purpose,Andersen, 2001) with a set of common elements to which many creators will conform. Thissuggests an evolution of the personal home page genre because little evidence ofcommonalities had been found previously (Bates & Lu, 1997. Personal home pages typicallycontain relatively little personal information (Dominick, 1999) and are often produced forsocial involvement (Pruijt, 2002), entertainment, self-expression, to provide information(Papacharissi, 2002), or in an attempt to exert power or influence over others (Zinkhan,Conchar, Gupta & Geissler, 1999). The last three motivations suggest that personal homepages will often contain 'public opinion', probably often concerning politics, but perhaps alsosometimes related to research or university activities. Nevertheless, personal home pagescannot be taken as a representative sample of public opinion on any topic, because bothpresentation and content will be significantly influenced by a range of factors such ascomputing skills (Garrett, Lundgren, & Nantz, 2000), gender (Dominick, 1999; Miller &Arnold, 2001) and probably also economic class and education level. In addition to publicopinion, university research can also be consumed by citizens as part of their daily lives. Thisprobably occurs in support of recreational activities (e.g. astronomy, media studies) but reallythe extent of public interest in academic research, and the contexts in which it is found usefulis unknown. The regular appearance of academic experts in news reporting (e.g., Landes &Posner, 2000), suggests that there is some potential for engagement with the public, however.

The very diversity/looseness of the web makes it very difficult to theorize effectivelyand it is particularly problematic for the dominant constructivist paradigm of sociologicalresearch: arguing from the particular to the general (Tashakkori & Teddlie, 1998). This is alsothe underlying reason for the incompleteness of information science academic web research.Macro level studies have shown that universities that conduct more research attractsignificantly more links, but a detailed analysis suggested that, in general, better researchersattract more links because they publish more online, rather than because what they publish isof higher quality or more visible for other reasons (Thelwall & Harries, 2004a). Moreover,only about 1% inter-university links target content equivalent to that of a journal article,although about 90% seem to link to pages with some academic nature (rather thanadministrative or recreational pages) (Wilkinson, Harries, Thelwall, & Price, 2003). As aresult, if link counts measure online 'impact' (Ingwersen, 1998) then it is difficult to interpret'impact' concretely. Nevertheless, the university interlinking studies provide a model fordealing with the complexity of the web, combining numerical macro-level analyses withmicro-level examinations of individual pages in order to gain new insights. Purely large-scaleanalyses run the risk of missing the enormous diversity of the web and falsely assuming causeand effect relationships that the data might suggest. Conversely, micro-level studies can yielda partial and unrepresentative picture, missing the wider trends. A combination of the two istherefore appropriate for initial exploratory studies of widespread web phenomena.

RESEARCH QUESTIONS

A sample of personal pages was investigated to see whether this new easily accessible datasource can be used to give evidence of the wider dissemination of research and also to seewhether it can give new insights into academic-related web linking in general. The main


questions addressed are given below, but as an exploratory analysis, for any other potentialapplications that may be suggested by the data will also be sought.

1) Do counts of links to university web sites from personal home pages correlate with theresearch impact of the target institutions? (main quantitative issue)

2) Do counts of links to university web sites from personal home pages correlate with countsof links to university web sites from other university web sites? (secondary quantitativeissue)

3) Can an analysis of individual non-academic personal home pages give useful informationabout public perceptions of and uses for academic research? (the main qualitative issue)

The first research question is an artificial one designed to identify evidence of non-randomness in the personal home page data set. Using the Shannon and Weaver (1963)information theory, information is manifestation of non-random behavior and thereforesignificant correlations point to the potential for useful information. More concretely,correlation with an existing source of information is a standard technique that is used inbibliometrics to evaluate new data sources of uncertain value (Oppenheim, 2000). If personalhome page links correlate with research impact then this helps to model the kind ofinformation that they represent. Just conceivably, at the university level, counts of links frompersonal home pages could be used to assess the public interest value of the researchconducted at a university (one logical extension of the correlation tests), but much moreevidence than just statistical data would be needed for this (again see Oppenheim, 2000). Thepurpose of the third research question is to assess the extent to which on a macro level linkattractiveness, as measured by inlink counts, is specific to the origin of the source pages. Aclose relationship would indicate that link attractiveness was a property of the target site,whereas a weak one would indicate that it was very context dependant.

METHODS

A mixed model methodology was adopted (Tashakkori & Teddlie, 1998): combining bothqualitative and quantitative techniques in order to identify any relationship between personalhome pages that link to universities and those universities. Statistical methods were used tolook for large-scale trends and a simultaneous qualitative investigation to look forexplanations of the phenomena found.

Data collection

The scope of the study was personal web sites in the UK outside of universities that link toUK academic institutions. Private Internet Service Providers (ISPs) in the UK typically allowsubscribers to post web pages on their servers, and these were the logical sources for samplesof personal pages and were used as a convenient operationalization of the personal home pageconcept (cf. Dominick, 1999; Papacharissi, 2002), albeit one that does not significantlyengage with the concept of genre: some personal spaces will contain pages that would not beconsidered to be part of the personal home page genre. There was no complete directory ofthese pages (although there are listings of ISPs) and so AltaVista's advanced search interface


was used to help find them. The query below was entered to identify web pages not in the UKor US academic domains that linked to the UK academic domain and contained the word"personal". This was a heuristic with the purpose of identifying the major hosts of such pages.

personal AND link:ac.uk AND NOT host:ac.uk AND NOT hostxdu

The resulting thousand pages included many from overseas universities and other irrelevantsources but the list was processed manually in order to keep only the pages hosted by UKISPs. This produced seven ISPs, each having a uniform naming strategy for personal webspace URLs. This represents a minority of UK ISPs: those with personal home pages bestindexed by AltaVista and therefore most visible online in this sense, and a larger sample thanused in previous research (e.g., Dominick, 1999; Papacharissi, 2002). Since different serviceproviders may have different typical user profiles, the following extra research question wasadded.4) (ISP bias test) Do the personal pages hosted by different ISPs have different tendencies to

link to university web sites? (methodological quantitative issue)New queries, shown below, were then submitted to AltaVista in order to find all the personalhome pages from the ISP that linked to a UK university. All URLs matching the resultingqueries were recorded. For some ISPs there were more matching URLs than AltaVista allowsand so only the first 1000 could be recorded.

• AOL host:hometown.aol.co.uk and link:ac.uk• Blueyonder hostpwp.blueyonder.co.uk and link:ac.uk• BT Internet hostwww.btintemet.com and link:ac.uk• Demon host:demon.co.uk and link:ac.uk• Freenet host:www.freenetpages.co.uk and link:ac.uk• Freeserve hostfreeserve.co.uk and link:ac.uk• Virgin host:freespace.virgin.net and link:ac.uk

The third data collection stage was to download all of the personal web sites (see the PersonalHome Page ADM section below for the scope of these) found by the AltaVista searches. Thiswas conducted on February 18, 2003 using the freeware crawler WinHTTrack(www.httrack.com). A modified version of the Wolverhampton University webometriccrawler SocSciBot (Thelwall, 2001b) was then used to extract the links from the AltaVistaindexed pages (only) from the hard disk copies of the downloaded sites. The purpose ofdownloading entire sites rather than the indexed page alone was to retain extra informationabout page contexts to be used in the classification process, when required.

Data analysis

Statistical Tests

It has become standard webometric practice to correlate link counts with research ratings inorder to ascertain whether a basic relationship could be present (Vaughan & Thelwall, 2003).


The correlation test is only one of several that should be applied to new data sources(Oppenheim, 2000) but it is a convenient first indication of a significant pattern in the data.One of the problems with applying statistical tests to link count data is that the necessaryassumption of independence is violated because a site with many inlinks should attract adisproportionate share of new links (Barabasi and R. Albert, 1999; Huberman, 2001;Pennock, G. Flake, S. Lawrence et al., 2002). In other words links influence each other.Fortunately, correlation tests are robust enough to accept some violations of their underlyingassumptions (Howell, 2002). Nevertheless, steps were taken to minimize the degree ofviolation by employing the Alternative Document Model concept (<chapter 3). A new type ofADM was deployed however, to cope with personal home sites because these employ URLstructures of a type that have not previously been analyzed. It was called the home site ADM.Its definition and rationale follow.

The Home Site ADM

It was hypothesized that the links and pages created in a single user's ISP personal web spacewould be related to each other whereas those created by different users would be mainlyunrelated, and so the essential distinction to be made was between user spaces (called 'homesites' here) rather than between pages or individual links. All links on the web potentiallyinfluence each other but the extent of the influence for unrelated links should typically beweak. As a result the independence of the link data source could be increased by taking stepsto avoid counting two or more identical links from the same user space. The home site ADMdocument was therefore defined to be all pages in the space of an individual user. For someISPs this coincided with the domain ADM because they gave all users a unique domain name(e.g. www.chriswillis.freeserve.co.uk). Other ISPs gave users a unique directory (e.g.freespace.virgin.net/alex.wong/) but this is not the same as the directory ADM because usersare allowed to create subdirectories and these would be counted as different directory ADMdocuments but would be part of the same home site ADM document.

For link targets the previously most reliable ADMs have been the directory anddomain models (Thelwall, 2002a; Thelwall & Harries, 2003), although the hybrid rangemodels have been slightly more successful (Thelwall & Wilkinson, 2004). The documentmodel for link targets that would give the most independent variables is the university model,also known as the site model. This would allow each home site to count only one link to eachuniversity. The specific type of anomaly that this would eliminate is that where one home sitelinks to many different pages of a single university, perhaps because of a close connection. Tobe concrete: a home site may link to both the home page of a university and the home page ofits astronomy department in the same sentence with one underling motive: to relate toastronomy information (e.g., Thelwall, 2003). The university ADM is therefore the mostappropriate for link targets but th s analytical conclusion was assessed against the datathrough the additional research question below. Note that the main model is a hybrid one,using a different ADM for link sources and targets.5) (ADM fitting) Is there clear evidence in the data that the university ADM fits best, in the

sense of reducing anomalies in the data? (a methodological quantitative issue)The results of a range of statistical tests and give precise details of each one are reported

in the results section to avoid unnecessary replication of information.


Qualitative Analysis

Correlation statistics are valuable for identifying patterns but should not be used to infer acausal relationship. An analysis of the individual pages involved is a necessary step towardsthis (Oppenheim, 2000; Thelwall, 2001a). Since home pages linking to university sites havenot been studied before there is no clear theoretical framework from which to start. This iscompounded by the known complexity of the inter-university linking phenomenon from thethree previous large-scale classification exercises (Wilkinson, Harries, Thelwall & Price,2003; Thelwall & Harries, 2003; Thelwall, 2001a), which does not give a clear basis forextrapolation. The page content analysis will therefore be an exploratory one. The overallapproach will be post coordinate clustering: visiting a random selection of pages and thendevising categories for pages that appear to be similar. This exercise was conducted by theauthor and represents my perspective on the data, with all my preconceptions and prejudices.This follows established practice and is methodologically a predominantly constructivistapproach which has the primary purpose of illuminating the data rather than providingrobustly generalizable results. A formal content analysis (Krippendorff, 1980; Weare & Lin,2000), which could have produced more robust results through independent classification andthe computation of inter-classifier agreement statistics, was rejected as being too restrictive(McMillan, 2000) and therefore inappropriate here. Nevertheless, to guard against bias,another person, Gareth Harries, classified the same list based upon the author's categories andthe results represent the consensus.

RESULTS

ISP bias test

Before running the main tests on the data set it would be useful to know whether any ISP hada significantly different user profile from the others, particularly with respect to academics. Itwas hypothesized that the older ISPs, such as Demon, would have a higher percentage ofacademic users than the newer simply because universities were early adopters of the Internet.If ISPs do have different user profiles then tests that combine them must be cautiouslyinterpreted.

The test devised was a very approximate one. It was hypothesized that the proportionof all personal pages hosted by an ISP that link to a university would be a basic indicator ofthe overall academic orientation of its users. The results are shown in Table 14.1, usingAltaVista advanced search commands for the data. This is a flawed test because some of thedomains also host material that does not come from personal home pages and also the use ofsearch engines as a data source precludes the use of ADMs. Nevertheless, the results areconsistent with the hypothesis that the different ISPs have a different character with respect topage creators.


Table 14.1. The proportion of ISP personal pages linking to an .ac.uk site (AltaVista data).

AOLBlueyonderBT Internet*Demon*FreenetFreeserve*Virgin

Total pages3,0149,910

81,36767,937

6,21050,69288,023

Pages with ac.uk link36

251949

2,957108

1,208992

Percentage1 %3%1 %4%2%2%1 %

* The totals include some general pages in addition to personal pages.

ADM fitting

This is another preliminary test, which has the purpose of validating the ADM used. Althougha new ADM has been created in the methods section, the home site ADM, and it was arguedthat the home site source - university target counting model is theoretically the most reliablein terms of producing the most statistically independent data, Table 2 presents the results offour ADMs for comparison purposes. The first two are the standard page source - page targetand directory source - directory target ADMs, and the last one is the main ADM describedabove. The third uses the new home site ADM for the source of links and the domain ADMfor targets. This is more appropriate than using the domain ADM for both source and targetbecause some ISPs host all home pages on the same domain, so this would not give consistentresults across ISPs. Note that this data excludes all pages that did not link to one of the 110university web sites because they either linked to a different .ac.uk site, or the .ac.uk link haddisappeared between AltaVista indexing the site and the research crawl fetching it.

The last four columns of Table 14.2 give the average number of links from the sourcedocument to distinct UK university target documents, using the ADMs as described above.For example, the first of the four columns is the average number of UK university web pageslinked to per home page identified. In the absence of anomalies and assuming uniform userprofiles, the ratios in each column of Table 14.2 should be approximately the same for allISPs. The very high page outlinks ratio for Freenet is a clear indication of an anomaly. Thiswas tracked down to be one page that contained a huge number of links: a copy of an indexpage for the Linux Gazette, linking to over 800 pages at the University of Manchester. Thetargets of these links were not in fact created in Manchester but were mirrored (copied) fromtheir origins at Vanderbilt University, USA. The directory model removed this anomaly butthere is no clear pattern for the document models with greater aggregation giving more similarresults as would be expected (Thelwall, 2004). This could be explained either by the biasbetween ISPs shown above masking the difference, or by a genuine lack of the kind ofanomalies that necessitate the use of ADMs aggregating beyond the directory level. Table14.2 therefore shows that the file model should not be used because of the identified anomalythat it would leave in the data, but does not clearly indicate any of the other three as bestoverall. The home site source - university target model (university outlinks per home site)should be the most reliable, however, because of the reasons given in the theoreticaldiscussion section above.

Note that the "Pages" column figures in Table 14.2 are smaller than the notionallyequivalent "Pages with ac.uk link" column in Table 14.1. The reason is that the data source


for Table 1 is AltaVista, whereas the data source for Table 14.2 is the WinHTTrack data. Thedifferences will be caused by a combination of reasons: links can be "lost" between AltaVistaand WinHTTTrack because pages have been changed or removed since AltaVista's last crawl,links were recorded in error by AltaVista, AltaVista did not report the page in its results listdespite counting it, the link was a mailto: address (counted by AltaVista but not in Table 14.2)or (for the larger ISPs) because of the restriction to 1000 results returned by AltaVista.

Table 14.2. Outlink counts to UK universities with different ADMs.

AOLBlueyonderBTinternetDemonFreenetFreeserveVirginTotal

Pages Directories22

120580681

76646638

2763

1598

399532

41512482

2079

Homesites

1592

361445

30464447

1854

Averagepage

outlinksper page

1.411.832.372.88

12.252.111.972.58

Averagedirectory

outlinks perdirectory

1.731.972.872.823.542.362.172.53

Averagedomain

outlinks perhome site

1.671.672.802.672.032.212.012.35

Averageuniversity

outlinks perhome site

1.471.512.482.211.831.951.852.06

Correlations between links and research ratings

This test addresses question 1. Link counts from personal home pages to the 110 UKuniversity institutions were correlated with a measure of their research productivity.Following established practice (Wilkinson, Thelwall & Li, 2003) the research productivity ofeach university was estimated using the data supplied by the last national ResearchAssessment Exercise (RAE) (http://www.rae.ac.uk/) by adding up the RAE scores ofacademics submitted to the exercise by the university. Specific details of this can be found ina previous study (Thelwall, 2002a) but the key point is that the result of the calculation is arelatively authoritative estimate of the total research productivity of each university.Correlating total link counts with total research productivity could give misleading results forthe UK, however, since institutional sizes vary greatly and size is a factor in total researchproductivity and potentially also for inlink counts: bigger universities could be expected toboth attract more links and conduct more research, without there necessarily being a directconnection between the two. Hence size was factored out from both variables by dividingboth by the total full-time equivalent faculty at the university.

The data failed a Kolmogorov-Smirnoff normality test and so Spearman correlationsare reported instead of Pearson, the results appearing in Table 14.3. Although all correlationsare highly statistically significant (p < 0.01), the last one is the mcst convincing for thereasons discussed above. Clearly research productivity highly associates with link attractionfrom personal home sites. No causal connection is claimed, however, since a theoreticalmodel of linking behavior has not yet been developed that would allow a hypothesis of thiskind. The results of the main counting model are shown in Figure 14.1. Two alternativeplausible interpretations of the graphs are of an approximately linear trend and a bipolar splitinto two groups: a low research productivity per faculty half (e.g. < 3) and a high researchproductivity per faculty half (e.g. > 3), Some of low link attractiveness anomalies in the


second half are discipline-specific institutions that operate in subject areas that generallyattract few links. The School of Oriental and African Studies is a case in point: humanitiesresearchers in universities seem to interlink only rarely (Thelwall & Tang, 2003) and so itslow inlink count is not surprising. A linear trend with anomalies is therefore a likelyexplanation for the graph.

Table 14.3. Correlations between research productivity per faculty and inlinks found perfaculty for UK universities.

Source ADMPageDirectoryHome siteADMHome siteADM

Target ADMPage

DirectoryDomain

University

Spearman's Rho0.6930.7410.754

0.758

Figure 14.1. Inlinks per faculty against research productivity per faculty (home site source -university target ADM).

A comparison of university and home page link sources

This test addresses question 3. Links from home pages were compared with links from theuniversities themselves, using UK inter-university linking data from a publicly availablesource (http://cybermetrics.wlv.ac.uk). The purpose of this was to see whether the two weresignificantly different phenomena on a macroscopic level. The results are shown in Figure14.2, comparing the main home site counting model with the domain counting model forinter-university links. The trend this time is very linear, (Spearman's rho: 0.851, Pearson's r:0.859) suggesting that there may be an underlying cause for both types of links, or possiblythat one influences the other. The two counting models are not directly comparable and thehome site source - domain target model is more directly comparable, although the data is less


independent. An equivalent analysis was conducted for this and the results (not shown) werevery similar, in fact the graph was slightly more linear in appearance.

Figure 14.2 is disappointing for one of the main issues of the paper: it suggests thatpersonal home pages may not be able to give a new macroscopic perspective on universities.This is not a necessary conclusion, however, as an investigation of the contents of individuallink pages may still yield a different set of creation motivations. It also suggests that linkattractiveness may be a relatively robust construct at the university level, one that extendsbeyond the academic domain.

Figure 14.2. Inlinks from university pages against inlinks from personal home pages (PHPdata) for UK universities.

Individual page categorizations

The classification exercise provides two types of information that will help to answer question2 by suggesting interpretations of the numerical results and identifying case studies tounderstand the phenomenon in more depth.

Page Creators

A sample of 271 pages chosen at random from the complete set of 2,763 were visited in orderto attempt to identify the relationship between the creator of the page and the university towhich it was linked. Clearly if all of the pages were created by current faculty then it wouldbe impossible to use them for evidence of public perceptions of universities. The results inTable 14.4 indicate that about a third of the personal home pages were created by peopleassociated with those universities. The identification of the relationship was difficult.Although in some cases it was clearly stated that the page creator was a member of theuniversity in question, in most no evidence was present and the categorization was inferredfrom the context of the link. An attribution of 'Not associated' was made when there was noreason to believe that a relationship existed. In many instances of uncertainty a web-based


investigation was conducted to seek an answer. For example one page linked to chemistryinformation at a university and the departmental web site was searched for the name of thepage author, finding him to be a lecturer there. Despite the care taken the figures probablyslightly overestimate the 'Not associated' category size since it is impossible to be certainthan no affiliation exists.

Table 14.4. Attributed page creatorPage creatorCurrent facultyFormer facultyOther currentemployeesCurrent studentsFormer studentsNot associated

Frequency Percentage49

2

22710

181

18%1 %

1 %10%

4%67%

Attributed Link Creation Purposes

The second categorization exercise attributed a purpose for each link in the sample of 271.The results are shown in Table 14.5. The categories were derived from the classificationprocess itself and the main ones are described below.

Table 14.5. Attributed purpose of link.Frequency Percentage

RecreationalAcademicOwn site (self-link)Credit/acknowledgementHosted site of othersGeographic proximityLibrary/archiveTeachingAdvertise courseOther

6951463422111085

15

25%19%17%13%8%4%4%3%2%6%

• Links were Classified as recreational if the target page contained information unrelated tothe job of the page creator in the target site. This was in almost all cases a clear-cutdecision. Links where the target page was not recreational but the source page was wouldnot be classified as recreational. The reason for this is that this latter group could includeapplications of research in hobby activities. A quarter of links were created forrecreational reasons, often links to sporting societies or clubs in universities, but also toindividual informational pages, a higher percentage than the 6.5% for inter-universitylinks (Wilkinson, Harries, Thelwall & Price, 2003).

• Academic links are the most important kind for this investigation: indicating thereferencing of academic information of some kind. These are discussed further below.


• Forty-six links were self-links: connecting to university pages created by the same author,or to the university home page. These mostly targeted personal or research pages. Oneexception was where the author referenced a departmental site as an example of theirhandiwork. This is the worst kind of link from the perspective of the study because it doesnot imply any kind of inter-personal communication.

• The credit/acknowledgement link class covers links to a university that do not have thepurpose of referencing information but only to reinforce a connection (Thelwall, 2003).These were often in sentences of the form: I was a lecturer/student in the department of Xat university Y.

• Twenty-two of the sites linked to were not actually university pages but were hosted bythe university either as a mirror site or as a service to an external organization such as anational society. These are awkward links to interpret since it could be presumed that astrong relationship would be present between the university and the society in order tohost its site.

• Eleven links were in pages of local information, listing their local university amongstother regional entities with web sites.

• Ten pages referenced a library or archive site in a university. An example of this was apage about the former British prime minister, Harold Macmillan(www.mdlg05075.pwp.blueyonder.co.uk/macmillan.htm) that contained a link to theBodleian Library at Oxford University because it holds his personal papers.

• Eight pages gave teaching information for specific courses, presumably placed on aprivate ISP either because the facility to put pages online was not available at the author'suniversity or simply because they had learned first how to put content online at home.

• Five pages linked to a university to advertise a course that it offered. For example onereligious page linked to a university offering a relevant education.

Further Analysis of Academic Links

The links judged to be created for academic reasons but not by current or former universitystaff or students are the most relevant to the research questions of this paper and so a sampleof these (the first 15 found) are individually listed below and then discussed in more detail.

• Interesting astronomy web sites link list (www.btinternet.com/_Dave.Eagle/sites.html)• Pollen germination page: link to research info on site hosted by the Natural History

Museum (www.btinternet.com/_jg/pollen2.html)• Records and information management links from a consulting company

(www.btinternet.com/_missenden_consulting/pages/useful_links.htm)• Ozone page with a link to the Encyclopedia of the Atmospheric Environment

(www.btinternet.com/_james.allen/topics/ozone.html)• Spectrum programs page with a link to Spectrum compiler software

(www.breezer.demon.co.uk/spec/tech/prog.html)• Environmental company with a link to green resources page

(www.frey.demon.co.uk/links.html)• Biblical studies & home schooling links page linking to a page of internet resources for

the study and teaching of theology (www.robibrad.demon.co.uk/bibstuds.htm)• A National Pure Water Association water quality research site with a link to research

information (www.npwa.freeserve.co.uk/H2O.html).


• An artist's page with a link to the 3rd International Conference on Design and Emotion2002 (freespace.virgin.net/iain.irving/index.html).

• A personal page of an organ player linking to the National Pipe Organ Register atCambridge (www.mnemonics.freeserve.co.uk/englandl .htm).

• The Rail News Snippets magazine with a link to a seminar on transport for Cambridge(freespace. virgin. net/martin.thorne/snippets/snippets_49.htm).

• A St. Cenydd School History Department link to math numeracy resources(freespace.virgin.net/martie.wales/Numeracy/Mathematics.htm).

• A weather page linking to a map of earthquakes (freespace.virgin.net/ez.sale/weather-centre/satel 1 ite.htm).

• An archaeological photographer's home page linking to Cambridge University Committeeon Aerial Photography (freespace.virgin.net/paul.alice/links/flinks.html).

• A commercial scientist's online paper linking to a conference home page(freespace. virgin.net/j.foss/issls98.htm).

The majority of links are related to science, which would be consistent with the greater use ofthe web by scientists (Tang & Thelwall, 2003). Nevertheless there is a theology link and anart and design link. There were two further theology links in the sample of 271 but one wasfrom a clergyman at the university and the other targeted a course page. This is interestingfrom an historical perspective since theology has been marginalized in UK universities in themodern era following medieval dominance.

Two of the links have a recreational flavor: the Spectrum compilers link page, whichcould easily have been classified as mainly recreational, and the Astronomy page. Probablyfor the owner of the latter link source page astronomy is a hobby although the page targeted isan academic one.

As a final point, the wide range of source page owners is noteworthy: a school;individuals in a self-employed or private context; businesses; and a public information group.It would clearly take a much larger scale classification exercise to try to extract a pattern interms of the typical types of source page owners, and to relate these findings to the genre,gender and individual isolation issues identified in the literature review.

CONCLUSION

On a macro level, personal home page links to universities behave similarly to inter-universitylinks in terms of both the correlation between these two data sources and also betweenpersonal home page links to universities and university research productivity (researchquestions 1 and 2). The first correlation suggests that link counts do measure a real propertyof university web sites that could perhaps be called general link attractiveness. Based uponprevious research (Thelwall & Harries, 2004a), the general link attractiveness of a universitysite is determined primarily by its size, and university web site size is in turn determined byfaculty research productivity (irrespective of research content). The macroscopic relationshipis unlikely to be due to university members creating the personal home pages since thisappears to account for only a third of such pages. The second correlation confirms that thereis a relationship between a university's research and the total number of links that it receivesfrom personal home pages, but the breakdown of attributed link motivations shows that the


most sought type - links to academic content - accounts for only about 19% of all links. As aresult, macroscopic level studies cannot be used to gain evidence of the public recognition ofacademic research. For this kind of information the link source pages would have to becarefully categorized first. Individual researchers may also want to find the same type of datafrom a different angle: retrieving from an advanced search engine query just the pages linkingto their own site (Thelwall, 2002b). The page categorization exercise therefore showed thatthe relationship should not be interpreted as one of predominantly informal scholarlycommunication in the way that inter-university links can (Wilkinson, Harries, Thelwall &Price, 2003). Possibly the public genuinely have little interest in online academic research. Itis known that most scientists are unable to communicate effectively with the public (Hartz &Chappell, 1997), but if this is true then they may be missing out on the mass communicationpotential of the web to harness public interest as an extra reason for funding their research.Despite the generally low level of linking to academic research, it is likely that in someparticularly high profile areas (e.g. astronomy, genetically modified crops) there will beenough links to yield data about facets of public use of university research.

Some interesting information was gained from the analysis of individual pages thatlink to academic research, showing the wide range of contexts in which academic research islinked. In addition, the classification exercise showed a much larger proportion of recreationallinks than previously found for inter-university links (Wilkinson, Thelwall & Li, 2003).University web sites clearly contain a lot of recreational information that is of value in thewider community. In many cases the information is related to the existence of a hobby-relatedteam in the university. This is an untrumpeted aspect of a university: the visibility in the widercommunity of its recreational activities, at least online. There is also significant evidence ofinformal technology transfer from universities into the wider community in the form of website creation and hosting. Universities have clearly helped a number of organizations to get aweb site and enjoy the benefits of web publicity.

In conclusion, personal home pages can give new insights into the relationshipbetween the public and universities, both in the form of the direct dissemination of academicresearch and through a range of less formal activities that can make a valuable contribution tothe national infrastructure, particularly from a social perspective. In future research, if usefulinformation about the wider dissemination of academic research is needed then PHPAlternative Document Models must be used and links must be classified in order to separateout the minority that are relevant to the issue.

META-CONCLUSIONS

This case study is an example of a web link analysis to investigate relationships betweenuniversities and non-university web sites. One of its strengths is its review and use ofliterature relevant to the specific topic of research, rather than just related to link analysis. Aweakness is that much larger numbers of pages needed to be classified to give reasonableestimates about the proportion of pages in each category. The results are therefore onlyindicative and not statistically strong.

ACKNOWEDGEMENT

First published in the Journal of Information Science 30(3) 2004 - Copyright CILIP 2004 -republished with permission (Thelwall & Harries, 2004b).


REFERENCES

Andersen, J. (2001). The concept of genre: When, how and why? Knowledge Organization28(4) 203-204.

Barabasi A.L. & Albert, R. (1999). The emergence of scaling in random networks, Science286, 509-512.

Bar-IIan, J. (1997). The 'Mad cow disease', Usenet newsgroups and bibliometric laws,Scientometrics 39(1) 29-55.

Bates M.J. & Lu, S. (1997). An exploratory profile of personal homepages: Content, design,metaphors, Online & CD ROM Review 21(6) 331-340.

Bazerman, C. (2003). Genre and identity: Citizenship in the age of the Internet and the age ofglobal capitalism. Available: http://www.education.ucsb.edu/~bazerman/gender.htm

Burnett R. & Marshall, P. (2002). Web theory: An introduction (Routledge, London).Dillon A. & Gushrowski, B.A. (2000). Genres and the web: Is the personal home page the

first uniquely digital genre? Journal of the American Society for Information Science51(2)202-205.

Dominick, J.R. (1999). Who do you think you are? Personal home pages and self-presentationon the World Wide Web. Journalism & Mass Communication Quarterly 76(4) 646-658.

Foot, K., Schneider, S., Dougherty, M., Xenos, M. & Larsen, E. (2003). Analyzing linkingpractices: Candidate sites in the 2002 US electoral web sphere, Journal of ComputerMediated Communication 8(4) Available:http://www.ascusc.org/jcmc/vol8/issue4/foot.html

Fredrick, C. (1999) Feminist rhetoric in cyberspace: The ethos of feminist Usenetnewsgroups, Information Society 15(3) 187-197.

Garrett, N.A., Lundgren, T.D. & Nantz, K.S. (2000). Faculty course use of the Internet,Journal of Computer Information Systems 41(1) 79-83.

Garrido, M. & Halavais, A. (2003). Mapping networks of support for the Zapatistamovement: Applying social network analysis to study contemporary social movements.In: M. McCaughey & M. Ayers (eds). Cyberactivism: online activism in theory andpractice (Routledge, New York, pp. 165-184).

Gibbons, M. (1999). Science's new social contract with society, Nature 402 C81-C84.Hammond N. & Bennett, C. (2002). Discipline differences in role and use of ICT to support

group-based learning, Journal of Computer Assisted Learning 18(1) 55-63.Hartz J. & Chappell, R. (1997). Worlds apart: How the distance between science and

journalism threatens America's future, The First Amendment Center, Nashville, TN.Hearit, K.M. (1999). Newsgroups, activist publics, and corporate apologia: The case of Intel

and its Pentium chip, Public Relations Review 25(3) 291-308.Herring, S.D. (2001). Using the World Wide Web for research: Are faculty satisfied? Journal

of Academic Librarianship 27(3) 213-219.Hine, C. (2000). Virtual Ethnography (Sage, London,).Howell, D.C. (2002). Statistical methods for psychology (Duxbury, Pacific Grove, USA).Huberman, B.A. (2001). The laws of the web: Patterns in the ecology of information (MIT

Press, Cambridge, Mass).Ingwersen, P. (1998). The calculation of Web Impact Factors, Journal of Documentation

54(2) 236-243.


Kling R. & McKim, G. (2000). Not just a matter of time: Field differences in the shaping ofelectronic media in supporting scientific communication, Journal of the AmericanSociety for Information Science 51(14) 1306-1320.

Kot, M., Silverman, E. & Berg, C.A. (2003). Zipfs law and the diversity of biologynewsgroups, Scientometrics 56(2) 247-257.

Krippendorff, K. (1980). Content Analysis; An Introduction to its Methodology, Sage,Beverly Hills CA.

Landes, W.M. & Posner, R.A. (2000). Citations, age, fame, and the web, Journal of LegalStudies 29(1) 319-344.

Lederbogen U. & Trebbe, J. (2003). Promoting science on the web: Public relations forscientific organizations - results of a content analysis, Science Communication, 24(3)333-352.

McCaughey M. & Ayers M. (eds) (2003). Cyberactivism: online activism in theory andpractice, Routledge: New York.

McMillan, S. (2000). The microscope and the moving target: The challenge of applyingcontent analysis to the world wide web, Journalism & Mass Communication Quarterly77(1)80-98.

Miller H. & Arnold, J. (2001). Breaking away from grounded identity? Women academics onthe Web. CyberPsychology & Behavior 4 (1) 95-108.

Oppenheim, C. (2000). Do patent citations count? In: B. Cronin & H.B. Atkins, (eds.), Theweb of knowledge: a festschrift in honor of Eugene GarfielA. Information Today,Metford, NJ, pp. 405-432.

Papacharissi, Z. (2002). The self online: The utility of personal home pages, Journal ofBroadcasting & Electronic Media 46(3) 346-368.

Pennock, D. Flake, G. Lawrence, S. Glover, E. & Giles, C.L. (2002). Winners don't take all:Characterizing the competition for links on the web, Proceedings of the NationalAcademy of Sciences 99(8) 5207-5211.

Pruijt, H. (2002). Social capital and the equalizing potential of the Internet, Social ScienceComputer Review 20(2) 109-115.

Shannon C. & Weaver, W. (1963). Mathematical theory of communication. University ofIllinois Press, Illinois.

Sloan, B. (2002). Personal Citation Index: Exploring the impact of selected papers. Availableat: http://www.lis.uiuc.edu/~b-sloan/pci2.html, Accessed 18 June, 2002.

Sloan, B. (2001). Personal citation index, JESSE archives November 2001 (#74). Available at:http://listserv.utk.edu.

Stubbs, P. (1999). Virtual diaspora?: Imagining Croatia on-line, Sociological Research Online4(2)U102-U118.

Tang R. & Thelwall, M. (2003). Disciplinary differences in US academic departmental website interlinking, Library & Information Science Research 25(4) 437-458.

Tashakkori A. & Teddlie, C. Mixed methodology (Sage London, 1998).Thelwall M. & Harries, G. (2003). The connection between the research of a university and

counts of links to its web pages: An investigation based upon a classification of therelationships of pages to the research of the host university, Journal of the AmericanSociety for Information Science and Technology 54(7) 594-602.

Thelwall M. & Harries, G. (2004a). Do better scholars' web publications have significantlyhigher online impact? Journal of the American Society for Information Science andTechnology 55(2) 149-159.


Thelwall, M. & Harries, G. (2004b). Can personal web pages that link to universities yieldinformation about the wider dissemination of research? Journal of Information Science,30(3), 243-256.

Thelwall M. & Tang, R. (2003). Disciplinary and linguistic considerations for academic weblinking: An exploratory hyperlink mediated study with Mainland China and Taiwan,Scientometrics 58(1) 153-179.

Thelwall M. & Wilkinson, D. (2003). Three target document range metrics for universityWeb sites, Journal of the American Society for Information Science and Technology54(6) 489-496.

Thelwall, M. (2001a). Extracting macroscopic information from web links, Journal of theAmerican Society for Information Science and Technology 52 (13) 1157-1168.

Thelwall, M. (2001b). A Web crawler design for data mining, Journal of Information Science27(5)319-325.

Thelwall, M. (2002a). Conceptualizing documentation on the web: an evaluation of differentheuristic-based models for counting links between university web sites, Journal of theAmerican Society for Information Science and Technology 53(12) 995-1005.

Thelwall, M. (2002b). Research dissemination and invocation on the Web, OnlineInformation Review 26(6) 413-420.

Thelwall, M. (2003). What is this link doing here? Beginning a fine-grained process ofidentifying reasons for academic hyperlink creation, Information Research 8(3) paperno. 151. Available at: http://informationr.net/ir/8-3/paperl51.html

Thelwall, M. (2004). Methods for reporting on the targets of links from national systems ofuniversity web sites, Information Processing & Management 40(1) 125-144.

Treise, D., Walsh-Childers, K., Weigold, M.F., & Friedman, M. (2003). Cultivating thescience Internet audience: Impact of brand and domain on source credibility for scienceinformation, Science Communication 24(3) 309-332.

Vaughan L. & Thelwall, M. (2003). Scholarly use of the Web: What are the key inducers oflinks to journal web sites? Journal of the American Society for Information Science andTechnology 54(1) 29-38.

Weare C. & Lin, W.Y. (2000). Content analysis of the World Wide Web-Opportunities andchallenges, Social Science Computer Review 18(3) 272-292.

Weigold, M.F. (2001). Communicating science: A review of the literature, ScienceCommunication 23(2) 164-193.

Wilkinson, D., Harries, G., Thelwall, M., & Price, E. (2003). Motivations for academic website interlinking: Evidence for the web as a novel source of information on informalscholarly communication, Journal of Information Science 29(1) 59-66.

Wilkinson, D., Thelwall M. & Li, X. (2003). Exploiting hyperlinks to study academic Webuse, Social Science Computer Review 21(3) 340-351.

Zinkhan, G.M., Conchar, M., Gupta, A. & Geissler, G. (1999). Motivations underlying thecreation of personal web pages: An exploratory study, Advances in Consumer Research26 69-74.

Academic Networks 163

15

ACADEMIC NETWORKS

OBJECTIVES

• To explore how network diagrams can be used in academic webs.

INTRODUCTION

This chapter presents a small-scale exploratory case study of New Zealand university websites. The aim is to illustrate how network diagram techniques can be applied to nationalacademic webs, and hint at the types of information that network analysis techniques canreveal. There are two levels of the case study: university-wide and subject specific. In markedcontrast to the academic publication style of chapter 14, this chapter is very exploratory innature, investigating the techniques but not validating the findings or producing credibleacademic results.

METHODS

Raw data for the New Zealand university web site links is from SocSciBot crawls, database14 on the cybermetrics web site (cybermetrics.wlv.ac.uk/database/). The crawl was conductedin December 2003. SocSciBot Tools was used to process the link structure files and to convertthem into Pajek files. Pajek (Vlado, 2004) is a free network visualization program that can beused to create graphs of any kind of network. Various SocSciBot Tools options were used forthe different diagrams produced, for example whether to include site self-links. The domainADM was used for all diagrams. Manual searching of the New Zealand university web siteswas used to identify departments for the subject specific networks.


Figure 15.1. Information flow.

The information flow for the data used is illustrated in Figure 15.1. Pajek was used to producenetwork visualizations from the files produced by SocSciBot Tools. The Kamada-Kawaialgorithm in Pajek was used to create the visualizations.

UNIVERSITY SITE MAPS

Figures 15.2 and 15.3 are two site networks drawn at the domain ADM level. This means thateach arrow represents at least one page in the source domain that targets at least one page inthe target domain. Figure 15.2 is a small site and the diagram illustrates the centrality of themain site. Figure 15.3 is a much larger and more complex site, from which it is hardlypossible to identify any structure.

Figure 15.2. Domain map of Lincoln University.


Figure 15.3. Domain map of Victoria University of Wellington.

For sites as complex as Figure 15.3, where there are so many labels that the structureand names are obscured, the underlying structure can be seen by plotting the graph withoutthe labels (Figure 15.4). A combination of the two graphs is more informative than either ontheir own, even though only a minority of the labels can be read. For example, several focalpoints of links can be seen in Figure 15.4 without the node labels, but the labels in Figure 15.3can show the themes associated with each focal point. Even from the obscured labels inFigure 15.3 it can be seen that one of the focal points is associated with domain names endingin .distance.scs.vuw.ac.nz. These are all associated with the Distance web site of the StudentComputing Services.

A sensible alternative approach for mapping large sites is to trim the network into oneor more smaller ones. This could be achieved by eliminating domains deemed to be irrelevantto the objectives of a given visualization, such as student sites, intranets, administrative pages,and online learning environments. Alternatively, all domains relating to one theme could beplotted, such as just the social science domains. Figure 15.5 illustrates domains connected tothe School of Information Management at the Victoria University of Wellington, NewZealand. It has two domains, its main site: sim.vuw.ac.nz and its online learning resourceblackboard.sim. vuw.ac.nz.


Figure 15.4. Domain map of Victoria University of Wellington without labels.

Figure 15.5. Site selflinks connecting the Victoria University of Wellington School ofInformation Management.


Another way to explore a single university web site is to map its relationship with otheruniversities. This can be achieved by excluding site self-links and plotting the university's siteoutlinks. This is shown in Figure 15.6 for a small university with few enough site outlinks tofit on a single diagram, and in Figure 15.7 for a larger university, with domain namesremoved to illustrate network structure. With large unlabelled networks like that in Figure15.7, the person producing the visualizations could use the interactive capabilities of thegraphing software (e.g. Pajek) to identify the names of the unlabelled nodes.

Figure 15.6. Site outlinks from Auckland University of Technology.


Figure 15.7. Site outlinks from of Auckland University.

NATIONAL ACADEMIC WEB MAPS

Figure 15.8 and 15.9 are networks created from intersite links for all New Zealanduniversities. In Figure 15.8 domains are only shown if they are the source or target of at leastone inter-university link, and all home domains are excluded. Even after these restrictions thegraph is unmanageably large.

In order to obtain a smaller diagram, in Pajek it is possible to remove all nodes thathave fewer than a given number of links attached. This option can be used to reduce anetwork to just its most central nodes. Figure 15.9 is the same data set as Figure 15.8, butafter excluding all domains that have less than 8 links attached to them. The number 8 wasdecided upon through trial and error, being the smallest number that produced a cleardiagram. There are few enough domains in this diagram for the labels to be added. From thedomain names alone it can be seen that mathematics and computer science (cs) areparticularly well connected in the New Zealand academic web.


Figure 15.8. New Zealand domain academic web (>0 links per domain).

Figure 15.9. New Zealand domain academic web (>7 links per domain).


The problem of producing visualizations that are not unmanageably large is a critical one forweb data. In the context of images that can comfortably fit on the pages of this book, the sizerestriction is quite severe. For other purposes, such as producing online pictures or interactivevisualizations, much larger diagrams will be possible.

SUBJECT MAPS

Subject specific network maps may be able to reveal information about the relationshipbetween one subject and others on the web. Figure 15.10 is a network for the site outlinks ofNew Zealand psychology department domains, and Figure 15.11 is a network of site inlinks.There is very little interdisciplinarity evident, with most of the links associated with homesites or other psychology departments. From Figure 15.11, no non-psychology domains linkto psychology domains in different universities, which is very surprising.

See also Figure 15.5 for a site-specific and subject-specific map.

Figure 15.10.. Site outlinks for New Zealand psychology departments.


Figure 15.11. Site inlinks for New Zealand psychology departments.

SUMMARY

The network diagrams have been able to reveal some information about the structures ofuniversity webs in New Zealand. Similar network diagrams could be produced for anycountry, but the diagrams would need to be larger if more universities were involved. Forcountries like the UK, with over 100 different university institutions, the scale of the diagramsis a real problem and it may be necessary to select a subject in order to give a manageableamount of information in the diagram, or to plot only the best connected domains.

Finally, the case studies presented here are rather dry because they were not attachedto specific research questions. When combined with a given question, such as whetherbiology is more interdisciplinary than chemistry, then the diagrams can serve to illustrate theissues involved. Key network connections can them also be individually investigated to relatethe research question to the reason why they were created.

FURTHER READING

The field of knowledge domain visualization (Borner, Chen & Boyack, 2003; Chen, 2003)has produced many academic network diagrams, although rarely for the web, and containsmany useful ideas and much good advice concerning visualization production.


REFERENCES

Borner, K., Chen, C. & Boyack K. (2003). Visualizing knowledge domains. Annual Review ofInformation Science & Technology, 37, 179-255.

Chen, C. (2003). Mapping scientific frontiers: The quest for knowledge visualization. NewYork: Springer Verlag.

Vlado, A. (2004). Networks / Pajek. http://vlado.fmf.uni-lj.si/pub/networks/pajek/

Business Web Sites 173

16

BUSINESS WEB SITES

OBJECTIVES

• To illustrate how link analysis of business web sites and related search engine basedtechniques can provide useful commercial information.

INTRODUCTION

The chapter presents some basic techniques for web site designers to ensure that their site isusing the web effectively. Some additional methods to get information from competitors' websites are also introduced to help businesses formulate their overall web strategies. Thetechniques build upon the search engine and web design themes developed in chapter 12, butextend them to comparing competing web sites. In a business context, some of this can bedescribed as competitive intelligence. The chapter concludes with a simple case study thatillustrates how a small amount of work can yield useful insights into others' web strategies.

SITE COVERAGE CHECKS

Many commercial web sites are not indexed in search engines (Lawrence & Giles, 1999;Thelwall, 2000). Web masters should therefore monitor their site's coverage in searchengines. Some simple techniques can be used to assess this. The advanced search facilitiesoften provided by search engines (>chapter 17) can be used to count how many of a site'spages have been found and indexed. Site coverage should be checked in all the major searchengines as well as any other potential sources of additional visitors, such as relevant regionalportals.

The results from different search engines can be compared to a SocSciBot crawl andsite browsing to assess the proportion of a site's pages that are indexed by each search engine.Of course, web masters probably know how many pages are in their sites but a SocSciBotcrawl can (a) also be applied to others' web sites and (b) give useful information about thecrawlability of any site. More importantly, detective work needs to be done to find out whichpages in the site have not been indexed and why. Likely problems include non-indexable


links, non-HTML pages, complex URLs containing queries, and frameset pages (Thelwall,2000). The detective work may be simple in some cases, but in others may involve trackinglink structures in SocSciBot to find links that have not been followed in order to discover whyindividual pages are missing.

The results of an investigation can be used to fix site problems so that search enginescan index it better. The same investigation can also be conducted on others' sites in order toidentify pitfalls for a new sites to avoid.

SITE INDEXING AND RANKING CHECKS

Checking that a site's pages have been visited by search engine crawlers is only the firstsearch engine checking stage. The second stage is to assess how likely it is that potentialvisitors will find the site through standard search engine searches. This means brainstormingto create a list of queries that a potential visitor might type into a search engine, and thenchecking the queries in search engines.

For a business, potential visitors might already know the company name and type thisas a query. Alternatively, they might just search for a product or service that the companyoffers. Ideally, a company web site would be ranked number one for any search for any of itsproducts or trading names in any major search engine. This is unrealistic, but it is still sensibleto check ranking in the results pages of commercial search engines for likely queries. Somedetective work can then be used to judge why sites have achieved their ranking position. Thiswould involve at least an analysis of inlinks and text use in higher-ranked web sites (<chapter12). Search engine ranking analysis can also be applied to competitors' web sites to gainadditional insights into a business's market place, and to find effective techniques already inuse.

A direct way to analyze the current performance of a web site and discover whichsearch engines and links are providing new visitors is through web server log file analysis(Thelwall, 2001). This is outside the scope of this book, but brief introductory details aregiven here. Recall that web server log files, introduced in chapter 11, contain informationrecorded by web servers about the pages and other resources visited by users. One of thepieces of information recorded is the identity of the page previously visited by the user.Normally, the previous page will link to the logged page and so the server log file can be usedto estimate how many times a page has been reached through following any given link. Thisinformation can be used to build statistics about which search engines are sending visitors to asite if their users click on links in the search results page. These search results page links canalso reveal which queries the users typed into the search engine in the first place. A specialistweb server log file analysis program can extract this kind of information, which is typicallykept private to the company owning the server.

COMPETITIVE INTELLIGENCE

Competitive intelligence (CI) is "a systematic and ethical program for gathering, analyzing,and managing external information that can affect your company's plans, decisions, andoperations" (SCIP, 2004). The web is a natural source of information about other businessesbecause of its open access nature and the ease with which the information can be found


(Nordstrom & Pinkerton, 1999; Vibert, 2004). In this context, an important new CI challengefor companies, in addition to traditional CI needs, is to be aware of how their competitorsattempt to gain customers on the web. Much information about online marketing strategiesmay be found by browsing business web sites. Additional intelligence can be revealed about acompany by finding and investigating external web sites that link to them or mention them.Discovering this kind of information is really an extension of the standard business practice ofscanning the media for information about competitors (Underwood, 2001). The following twoactivities can initiate an investigation (cf. Vella & McGonagle, 2004).

• Using advanced commercial search engine queries to find out which web sites linkto each competitor's web site, visiting the linking site to discover why it createdthe link.

• Using basic commercial search engine queries to find out which web sites mentioncompetitors' names, and visiting each site to find out why the competitor ismentioned. Advanced search engine queries may be needed to narrow down searchresults if the competitor has a common name.

CASE STUDY

This case study is an investigation into the web presences of five similar companies from theperspective of a competitor wishing to move into the same market place. The companies arefamily-oriented holiday businesses operating in the UK, namely Butlins™, Pontins™, Haven

TTVfl T\A TA4

Holidays , Centre Pares and Hoseasons . The first four offer mainly caravan and chaletholidays, and the last mainly waterways-based holidays. All presumably seek to attractholidaymakers online in addition to through traditional mechanisms.

Only results from the search engine Google are reported below order to keep theamount of information manageable, but for a real application the same steps should berepeated for all of the major search engines.

The term query URL is repeatedly used below, and it refers to an URL containing aquestion mark. The purpose of a question mark in an URL is normally to separate theinformation on its left hand site concerning the location of the page from the information onits right hand side, which is information to send to a program located on the web server. Thefollowing example illustrates the query URL generated by a Google search for pagescontaining the word "butlins".

http://www.google.com/search?hl=en&lr=&ie=UTF-8&q=butlins&btnG=Search

The left hand part is the name of the program, "search" on the Google web site. On the righthand side of the question mark is data sent to the program: hl=en&lr=&ie=UTF-8&q=butlins&btnG=Search which in this case contains a variety of information split by theampersands into variable name - variable value pairs as shown in Table 16.1.


Table 16.1. The decoded Google search URL.Stringhi =en

lr=ie=UTF-8q=butlins

btnG=Search

Variablehilrie

qbtnG

Valueen

UTF-8butlinsSearch

InterpretationReport the results in English

Language restrictionCharacter encoding scheme to use

The query sent to GoogleThe name of the button just clicked

The software Macromedia Flash is referred to below. This is a technology that allowsmultimedia presentations to be published on the web. Flash files typically have the file nameextension . swf in their URLs. The following three types of software running on web serversto create web pages automatically are also alluded to below. The fourth one is not mentionedbut is also common and is included for completeness.

• JavaServer Pages Often identified by the web page file name extension . j sp• Active Server Pages Often identified by the web page file name extension . asp• Cold Fusion Markup Language Often identified by the web page file name

extension . cfm• PHP Often identified by the web page file name extensions . php or . php4

Center Pares

Results of SocSciBot site crawl A crawl from home page gave 402 pages, plus one redirectioninstruction (from the home page). Four of the 402 pages were PDF files, the rest werejsp/HTML files. There were no query URLs in the site.

Search engine coverage About 483 pages matched the Google querys i t e : c e n t e r p a r c s . co . uk. All matches are JavaServer Pages, but withoutqueries in their URLs. There were no query URLs in the site. Google coverage of thesite is better than SocSciBot, presumably through remembering old pages. There wereno problems with Google's coverage of the site.

Links to site About 93 links matched the Google query1 i n k : www. c e n t e r p a r c s . c o . uk, but the majority were site-selflinks.

Search engine searches Ranked 1 in Google for Cen te r Pares .http://www.centerparcs.co.uk/home.jsp

Mentions of site About 77,900 pages matched the Google query "Center p a r e s " -s i t e : c e n t e r p a r c s . co . uk. The list included many center pares sites fordifferent individual resorts. There were also many multiple-company holiday-orientedsites such as travelwithkids.about.com, the Naturist UK Fact File, and newspaperholiday review stories.

Hoseasons

Results of SocSciBot site crawl A crawl from home page gave 3,891 pages.


Search engine coverage About 3,890 pages matched the Google querys i t e : hoseasons . co . uk, and most contain query URLs. Some non-query URLsgave missing page errors, redirecting to a new Active Server Pages page. There weremany Active Server Pages without queries. There were also some strange secureHTML query URLs, such as:https://www.hoseasons.co.uk/holidays/rilodges/parks_agents/ATLV.html?420.2,0%20There were no problems with Google's coverage of the site.

Links to site About 65 links matched the Google query 1 i n k : www. hoseasons . c o . uk,including several important search engine directories such as the following.• uk.dir.yahoo.com/Business_and_Economy/Shopping_and_Services/

Travel_and_Transportation/Accommodation/• directory.google.com/Top/Regional/Europe/United_Kingdom/Travel_and_Touris

m/Travel_Services/Tour_Operators/Domestic/Search engine searches Ranked 1 in Google for hoseasons . http://www.hoseasons.co.uk/Mentions of site About 6,260 pages matched the Google query Hoseasons -

s i t e : hoseasons . c o . uk. The list mainly consisted of multiple-company holiday-oriented sites such as www.reviewcentre.com: holidays tour operators reviews, butalso www.ukbusinesspark.co.uk: an online competitive intelligence site, andtravelmole.com: "The Online Community for the Travel and Tourism Industry".

Butlins

Results of SocSciBot site crawl A crawl from the home page gave only 2 pages whenexcluding query URLs, and 778 pages when including them (there were over 2,000URLs but many gave duplicate pages). Some of the pages were very large. Forexample the pagehttp://www.butlinsonline.co.uk/index.cfm?channel=2370&fromDate=4/Jun/2004&endDate=2/Jul/index.cfm?channel=2074 was 1.21 Mb, being a table of all available familyholidays.

Search engine coverage About 2,120 pages matched the Google querys i t e : bu t l i n s o n l i n e . c o . u k , although almost all contain query URLs. Therewere some non-HTML Flash pages. There were no problems with Google's coverageof the site compared to SocSciBot.

Links to site About 128 links matched the Google queryl i n k : w w w . b u t l i n s o n l i n e . c o . u k , but the majority were site-selflinks. Mostother links were pages about accommodation from general holiday sites.

Search engine searches Ranked 1 in Google for b u t l i n s . http://www.butlinsonline.co.uk/Mentions of site About 45,000 pages matched the Google query bu t 1 i n s -

s i t e : b u t l i n s o n l i n e . co .u k The list mainly consisted of multiple-companyholiday-oriented sites such as www.discover-holidays.co.uk, but also very many ofwhat appear to be Butlins franchise sites such as "Butlins Skegness, Caravan Holidayswith Liesls Holidays", and Butlins-related activity or fan club sites, such ashttp://www.swimathon.org: "Butlins Swimathon". Butlins make extensive use of theweb across multiple domains.


Pontins

Results ofSocSciBot site crawl A crawl from home page excluding query URLs gave 1 page.Although all the pages end in html, this is followed by a query and a session ID totrack the individual user. Crawling inclusive of queries yielded 113 pages.

Search engine coverage About 55 pages matched the Google query s i t e : p o n t i n s . com,although some contain multimedia in the form of Macromedia Flash presentations.The site uses a lot of Flash, including the home page. The site home page also containsplain text links. The lower numbers found than through SocSciBot is a source ofconcern.

Links to site About 35 links matched the Google query 1 i nk : www. pont i n s . com, but themajority were site-selflinks. Most other links were pages about accommodation fromgeneral holiday sites. It was linked to from one major Google directory• dir.yahoo.com/Regional/Countries/United_Kingdom/Business_and_Economy/Sho

pping_and_Services/Travel_and_Transportation/Lodging/Resorts/Search engine searches Ranked 1 in Google for P o n t i n s . http://www.pontins.com/ also

http://www.pontins.co.uk/ redirects to http://www.pontins.com/Mentions of site About 15,000 pages matched the Google query p o n t i n s -

s i t e : pont i n s .com. The list contained many Pontins-related activity or fan clubsites, such as kevsfx.com: a Pontins holidaymaker's pictures, andwww.pontinsdolphin.4t.com: a Pontins Dolphin Holiday Village memories site.

Haven Holidays

Results of SocSciBot site crawl A crawl from home page gave 0 pages. A crawl includingquery URLs gave 8,670 pages. Some of these were very long queries, a possiblereason for Google's poor indexing (see below).

Search engine coverage 4 pages matched the Google query s i t e : haven-ho l i days . c o . uk. For an unknown reason, Google did not cover any of the queryURLs in this site.

1. www.haven-holidays.co.uk/touring2. online.haven-holidays.co.uk/sitemap.asp3. online.haven-holidays.co.uk/holiday_bargains.html4. haven-holidays.co.uk/

A further 4 pages matched the Google query s i t e : havenho l idays . com.1. www.havenholidays.com/holiday_bargains.html2. www.havenholidays.com/self_catering.html3. www.havenholidays.com/cssc4. www.havenholidays.com/short_breaks.html

Coverage in Google seems to be very poor, a serious concern especially because thereis no clear reason for the poor coverage.

Links to site About 12 links matched the Google query 1 ink : www. haven-ho l i d a y s . c o . uk, but there were several major directories, such as:• uk.dir.yahoo.com/Business_and_Economy/Shopping_and_Services/

Travel_and_Transportation/Accommodation/Resorts/.


Two alternative searches were tried: l i n k : o n l i n e , h a v e n - h o l i d a y s . co .u kreturned 1 link and l i n k : o n l i n e .haven-h o l i d a y s . c o . u k / p a g e . a sp? re f=4 04&language=l returned0 links.

Search engine searches Not in the first 50 of Google for haven. Matches 2 and 3 for "Havenholidays", but not for the official site, only for third party sites describing Haven.Following links gets a site using query-based URLs, e.g. http://online.haven-holidays.co.uk/page.asp?ref=404&language=l&EQID=5101-9999999. Direct typingin of the URLs www.haven-holidays.co.uk gives the following complex URL.http://online.haven-holidays.co.uk/page.asp?ref=404&language=l. Following links toother pages reveals many links, included some to pages with domain namewww.havenholidays.com. Direct typing of this home page redirects to the same mainpage as above.

Mentions of site About 17,200 pages matched the Google query "haven h o l i d a y s " -s i t e : haven-ho l i days . c o . u k . The list mainly consisted of multiple-companyholiday-oriented sites.

General queries

The following general searches were tried in Google and the results examined. Theimportance of multiple-site holiday companies is confirmed, as is the significantly differentperformances of the holiday company web sites.

• "UK h o l i d a y " Butlins was the 5th site for this search; the other companies werenot mentioned. The other sites in the top 10 contained information about multipleholiday companies.

• "UK Family h o l i d a y " Butlins was the top site for this search; Hoseasons was9th

• "UK Family b reak" All top 10 results were sites giving information aboutmultiple holiday companies.

SUMMARY

The link analysis was of some use for understanding site coverage issues. One of the sites hadvery bad Google coverage for reasons that were not clear but might be connected to its use oflong query URLs.

The text analysis was potentially very useful. A company could well consider itbeneficial to pay an employee to run through all the major sites mentioning each competitorand then attempt to get mentioned on the same site. This would yield useful publicity. Also,the types of site represented might give ideas about potential new customer bases that may nothave previously been considered, such as the naturists found mentioning Center Pares.Moreover, the many fan sites for Butlins and Pontins may give insights into the aspects oftheir holidays that customers had found valuable enough to record and publicize.

The three general search queries would have produced customers for the web site ofButlins, some for Hoseasons and none for the other companies, although all may have gotextra visitors indirectly from the top ranked multiple holiday company web sites. This


emphasizes the importance of these multiple holiday sites, but also shows that Butlins seemsto be very successful in its web positioning.

Although the amount of information about each company is variable and the textsearch results are dependant upon the uniqueness of their name, the overall results stillrepresent a considerable body of relevant knowledge obtained for relatively little effort, givena basic level of web use and searching skill.

FURTHER READING

For more wide-ranging web-based competitive intelligence strategies, see Vibert (2004),particularly chapter 5 (Vella & McGonagle, 2004), as well as Nordstrom and Pinkerton(1999).

Wormell (2001) has suggested that hyperlinks can be used on a larger scale forcompetitive intelligence by using advanced search engine queries to identify relationships andtrends on the web. Her suggestions fit well with the techniques applied in the majority of thisbook, but in this chapter the emphasis is on much smaller-scale localized investigations. For alarger-scale quantitative web study, see Vaughan's (2004) analysis of the relationshipbetween links to company web sites and the companies' revenue and profits.

REFERENCES

Lawrence, S. & Giles, C.L. (1999). Accessibility of information on the web. Nature, 400,107-109.

Nordstrom, R.D. & Pinkerton, R.L. (1999). Taking advantage of Internet sources to build acompetitive intelligence system. Competitive Intelligence Review, 10(1), 54-61.

SCIP - Society of Competitive Intelligence Professionals (2004). CI resources: What is CI?Available: http://www.scip.org/ci/

Thelwall, M. (2000). Commercial Web sites: Lost in cyberspace? Internet Research, 10(2),150-159.

Thelwall, M. (2001). Web log file analysis: Backlinks and queries. ASLIB Proceedings, 53(6),217-223.

Underwood, J. (2001). Competitive intelligence. New York: Capstone Express Exec, Wiley.Vaughan, L. (2004). Web hyperlinks reflect business performance: A study of US and

Chinese IT companies. Canadian Journal of Information and Library Science, 28(1),17-32.

Vella, CM. & McGonagle, J.J. (2004). In: Vibert, C. Competitive intelligence: A frameworkfor web-based analysis and decision making. Mason, OH: South-Western, pp. 69-82.

Vibert, C. (2004). Competitive intelligence: A framework for web-based analysis and decisionmaking. Mason, OH: South-Western.

Wormell, I. (2001). Informetrics and Webometrics for measuring impact, visibility, andconnectivity in science, politics and business. Competitive Intelligence Review, 12(1),12-23.

Commercial Search Engines 181

V TOOLS AND TECHNIQUES

Part V of this book is partly online and partly offline. In the case of some of the chapters, asignificant proportion is online, such as software for implementing the techniques described.In other chapters the online information serves the purpose of giving up-to-date informationabout relevant tools available, such as network visualization software and commercial searchengine advanced searches. Some of the chapters do not have significant information online,however, such as the chapter on embedded link analysis methodologies. Some generalinformation about the relevant online component of the book, if any, follows each chapter'ssummary.

17

USING COMMERCIAL SEARCH ENGINESAND THE INTERNET ARCHIVE

OBJECTIVES

• To review good practice for using commercial search engines or the Internet Archive.

INTRODUCTION

This chapter reviews the practical problems that may be faced when using commercial searchengines for link data or the Internet Archive for historical information concerning pages (insupport of link analysis). Web crawlers, search engines and the Internet Archive have alreadybeen discussed in chapter 2, but the purpose of this chapter is to deal with the practicalproblems facing researchers when collecting data. The emphasis will be on understanding theresults returned by search engines in order to avoid making mistakes when interpreting theirmeaning.


The online component of this chapter gives tips and advice for using specific searchengines. Before their revision in April 2004, the two most sophisticated advanced searchengine interfaces were those of AltaVista and AllTheWeb, but subsequently there has beenless choice for link analysis. Nevertheless, most search engines offer some link identificationor counting facilities.

CHECKING RESULTS

Search engines do not cover the whole web and their results can be unreliable and unstable(<chapter 2). A less visible problem is that of knowing how a query sent to a search enginewill be interpreted, and how it will compile its results. For example, at the time of writing theGoogle query link:www.wlv.ac.uk matches pages in the Google index that contain a link toeither of the two exact URLs http://www.wlv.ac.uk or http://www.wlv.ac.uk/ but longerURLs such as http://www.wlv.ac.uk/lib/ do not match. The equivalent command in AltaVistaused to match pages in the AltaVista index that contained a link to any URL containingwww.wlv.ac.uk, and so it would match all three of the URLs above. Careful reading of thehelp pages of both search engines could discover this difference. Other differences have notbeen documented, which leads to the following important conclusion.

Results from search engines should never be taken at face value, and shouldalways be investigated.

A different occurrence leading to the same conclusion is that researchers can make mistakeswhen interpreting results and formulating queries. For example, Smith (1999) pointed out thatAltaVista queries that were ostensibly for pages linking to whole countries, such as 1 i n k : pgfor pages that link to Papua New Guinea and l i n k : i d for pages that link to Indonesia,would get "wildly distorted" results. This was caused by the queries matching anywhere inthe URL, including the file name.

There is also a problem with making assumptions about the content or ownership of apage from its domain name. This is actually irrelevant to search engine accuracy, but can betested for at the same time as the search engine results are checked. Some domains, such as.nu, typically host pages that are completely unrelated to the owning country. Even apparentlydefinitive top-level domains, such as .es for Spain, may produce some surprises, such asforeign companies buying a .es domain name to help open a market place in Spain.

Help in understanding search engine results is available from online sources thatspecialize in tracking them, but these should not be relied upon since search enginealgorithms periodically change. The following steps are advocated to test search engineresults.

1 Visit a random sample of pages matching your queries to ensure that the pages doactually match. Expect a small percentage of the links to have disappeared. In largepages, you may need to "view source" to see the HTML code of the page visited andthen find the link URL.


Check the following to see which apply.• Is the exact link URL in the page?• Is a variant of the link URL in the page? For example, does it have a different

equivalent domain name or URL file name ending?• Is the link URL visible in the page or is it only in the HTML of the page, with

different text displayed in the browser?• Is the URL displayed on the screen, but not as a clickable link?Record the answer to the above questions for a sample of link results for differentqueries. A pattern should then emerge about the types of pages that match the queries.Look out also for other patterns in the results. For example, if URL variants are foundin the results, try to decipher the rules that have allowed these variants.

2 View the results pages and assess the exhaustiveness of the results. Check thefollowing.• Are more than one or two results returned for a single link source site?• Are any results returned from the same site as the URL, i.e. site self-links?• Are there any pages missing from the results that you know should be there? If so,

check if the search engine indexes the pages and, if it does, try to find out why thepages are missing.

Again, record the answers to the above questions for several queries and look forconsistency and patterns. View all of the results pages for queries, when possiblelooking for patterns also in the positioning of results. For instance, are well-known orlarge sites predominantly on the first results pages?

3 View the "matching pages" number counts reported on each results page (if any). Arethese the same on all results pages for a given query? If not, for example the numberon the first results page is always different from the numbers on all subsequent pages,try to deduce which are the most reliable. For example, the number on the first pagemay be an approximate guess, based upon processing only the most important results,whereas subsequent pages may be more accurate, based upon processing most or all ofthe search engine's data set.

DEALING WITH VARIATIONS IN RESULTS

In the early years of search engines, the numbers of 'hits' for any search could varyenormously from one second to the next. This led Rousseau (1999) to propose that thoseusing web count statistics from search engines in research should conduct several identicalsearches over a period of days, averaging the reported counts with their median. At the timeof writing, search engine results had become more stable and this approach did not seemnecessary any more. Nevertheless, it is still important to check to see how much resultschange over time. If large variations are observed then Rousseau's technique should beapplied.


USING MULTIPLE SEARCH ENGINES

One possible way to limit the impact of partial search engine coverage on web results is to usemore than one search engine. Since search engine overlaps in web coverage seem to besurprisingly small (Lawrence & Giles, 1999; Bar-Han, 2001), a logical strategy is to combinethe results of different engines, an approach promoted by Bar-Ilan (2000, 2001). This can beachieved by retrieving from each search engine a full list of URLs matching a given queryand then combining the lists after eliminating duplicates. This assumes that there are fewenough results that the search engines will report the URLs of all of their matching pages.This approach gives better coverage than the use of a single search engine, but multiplies thework needed for data collection since the results of each search engine need to be checked, asdescribed above, and then the checked results combined.

Bar-Ilan (2001) has compared different search engines to assess their overlaps incoverage of the web, and has made an important distinction between non-indexed URLs andnon-indexed content. This is relevant because a search engine may not index an URL becauseit has found the same content in another page with a different URL. When combining theresults of more than one search engine, it is therefore desirable to perform a content check onthe URLs to ensure that duplicate URLs for the same page content are discarded.

The combined results of several search engines are an improvement on the results ofany individual engine, but still suffer from the generic biases for search engines, caused bytheir reliance upon crawlers (<chapter 2).

USING THE INTERNET ARCHIVE

Recall that the Internet Archive (<chapter 2) operates like a commercial search engine, exceptthat it is a not-for-profit organization with the aim of providing a historical record of the web.Old pages are not discarded when they are replaced on the web; they are kept alongside newerversions. At the time of writing, the Internet Archive did not offer a link search facility. If itoffers this service then the same testing should be applied to it as to commercial searchengines, as described above. The Archive offers programmers access to their raw data, soresearchers may wish to write their own programs to extract link statistics.

The Archive can be used as part of link analysis exercises even when not directly usedto find links. For example, if a link database has been used and the source page of a link needsto be visited, the Archive's WayBack Machine could be used to find a historical copy fromapproximately the correct date. The WayBack Machine interface allows searches for allcopies of a given page that are stored in the Archive, reporting the crawl date for each one. Itappears to be accurate and reliable and so the only data issue is the fact that the coverage ofthe Archive, like anything else compiled by a crawler, is limited to the sites that the crawlerwas able to find. Unfortunately, this means that older sites and better linked to sites are morelikely to be in the archive. On an international scale, countries that adopted the web early arelikely to be better represented in the archive than more recent adopters (Thelwall & Vaughan,2004). This problem cannot be easily avoided when using the Archive, but should beacknowledged when reporting results.


SUMMARY

Commercial search engines offer link search facilities that are valuable for link analysis.Extensive testing of search engine results is needed to make sure that searches are returningpages that do match the query in the intended way, however, and also to identify any types ofpages that are (perhaps unintentionally) excluded from the results. Identical queries mayoperate in significantly different ways between search engines, so testing should be extendedto each search engine used.

ONLINE RESOURCES

The online component of this chapter contains information about the link analysis supportgiven by different search engines as well as instructions for link-related searcheshttp://linkanalysis.wlv.ac.uk/17.htm. Table 17.1 illustrates the kind of information that isavailable. More information is online, and any changed URLs will be updated there.

Table 17.1. Information available in the online component of this book.Information/UseLinks to particularly useful searchengines.

Links to sites giving information aboutsearch engines.To retrieve up-to date information aboutthe algorithms and coverage of the majorcommercial search engines.Instructions for how to conduct linkanalysis queries in commercial searchengines, where possible.To be able to use commercial searchengines to obtain various different linkcount and related statistics for use in linkanalysis investigations.

ExamplesGoogle: http://www.google.com/HotBot: http://www.hotbot.com/Yahoo!: http://www.yahoo.com/The Internet Archive: http://www.archive.org/Search Engine Watch.com

http://www.searchenginewatch.com/Greg Notess' Search Engine News

http://notess.com/search/

In Google the search for 1 ink : URL will matchany page that links to the precise URL given. Forexample, l i n k : c y b e r m e t r i c s . w l v . a c . u kwill match all pages that link to the Cybermetricsresearch group home pagehttp://cybermetrics.wlv.ac.uk but not to any of theother pages in that site.Similar information about other search enginesand searches.

Figure 17.1 shows the top of the online component of this chapter, as of July, 2004.


Figure 17.1. The online component of chapter 17 (http://linkanalysis.wlv.ac.uk/17.htm).

FURTHER READING

Horror stories about search engine results prove salutary reading (Rousseau, 1999; Bar-Ilan,1999; Thelwall, 2001; Smith, 1999; Snyder & Rosenbaum, 1999). Some insights into thenature of different types of variability are given by Mettrop & Nieuwenhuysen (2001).

REFERENCES

Bar-Ilan, J. (1999). Search engine results over time: A case study on search engine stability.Cybermetrics, 2/3(1), paper 1. Available:http://www.cindoc.csic.es/cybermetrics/articles/v2ilpl.html

Bar-Ilan, J. (2000). The web as an information source on informetrics? A content analysis.Journal of the American Society for Information Science, 51(5), 432-443.

Bar-Ilan, J. (2001). Data collection methods on the web for informetric purposes: A reviewand analysis. Scientometrics, 50(1), 7-32.

Lawrence, S. & Giles, C.L. (1999). Accessibility and distribution of information on the web.Nature, 400, 107-110.

Mettrop, W. & Nieuwenhuysen, P. (2001). Internet search engines - fluctuations in documentaccessibility, Journal of Documentation, 57(5), 623-651.


Rousseau, R. (1999). Daily time series of common single word searches in AltaVista andNorthernLight. Cybermetrics, 2/3(1), paper 2. Available:http://www.cindoc.csic.es/cybermetrics/articles/vlilpl.html

Smith, A.G. (1999). A tale of two Web spaces: Comparing sites using web impact factors.Journal of Documentation, 55(5), 577-592.

Snyder, H. & Rosenbaum, H. (1999). Can search engines be used as tools for web-linkanalysis? A critical view. Journal of Documentation, 55(4), 375-384.

Thelwall, M., & Vaughan, L. (2004). A fair history of the web? Examining country balance inthe Internet Archive. Library & Information Science Research, 26(2), 162-176.

Thelwall, M. (2001). The responsiveness of search engine indexes, Cybermetrics, 5(1),http://www.cindoc.csic.es/cybermetrics/articles/v5ilpl.html.


Personal Crawlers 189

18

PERSONAL CRAWLERS

OBJECTIVES

• To review the personal crawler types available.• To provide an overview of SocSciBot and SocSciBot Tools.

INTRODUCTION

This chapter covers a similar topic to chapter 2, web crawlers and search engines, but fromthe perspective of someone wishing to use a web crawler to gather link data. The online partof this book gives information about currently available crawlers, and this chapter includes anoverview of the types available and practical issues for users. SocSciBot and SocSciBot Tools(http://socscibot.wlv.ac.uk), the software associated with this book, are given special coveragebecause they can perform all the analyses described in this book.

TYPES OF PERSONAL CRAWLER

There are many different web crawlers that can be downloaded over the Internet either free orfor a small charge. Other crawlers are supplied as part of larger software suites. A commonpurpose for crawlers is for use by web site designers or managers to check sites for brokenlinks. This may include checking links to external sites, and producing link-basedvisualizations of site structures.Site management programs Site management programs typically produce web site reportscombining link summary statistics with other summary statistics concerning the pagescrawled (e.g., average size, file extension, file format). This type of program is known bymany different names, including: site manager, site profiler, site analyzer, and link checker.The crawlers of site management programs gather the raw data for the reports. The crawlerswill each be unique in the hidden details of their crawling strategies and so different crawlerscan be expected to report different statistics for the same site (Arroyo, 2004). There is no wayto find out directly the full crawl strategy of most site management crawlers, but detailedexaminations of their results and testing on sample sites is recommended to identify the broadparameters under which each one operates. For instance, it would be useful to know whether a


crawler automatically rejects all URLs containing a question mark, because they are likely tobe automatically generated pages and may lead to spider traps. The main practical problemwith most site management crawlers is that their link statistics are not designed for linkanalysis purposes and so are not presented in a useful form for this. The first test to be madeof any site management crawler is therefore whether the necessary statistics or link lists canbe extracted from its results.Research crawlers Research crawlers, including SocSciBot, crawl web sites with the primarypurpose of producing link and other statistics for research rather than for site managementpurposes.Site downloaders A site downloader is a crawler that is designed to copy sites from the web toa local machine, normally so that the user can browse the site offline, i.e. viewing the localcopies of the pages instead of the originals on the web. Common names for this type ofsoftware include web site downloaders/copiers/grabbers/scanners, download managers andoffline browsers/explorers. Some site downloaders report link statistics as a by-product oftheir main activity. Researchers can use site downloaders to copy a site, subsequently usingtheir own program to extract link data from the downloaded site. An example of a sitedownloader is Teleport Pro (www.tenmax.com/teleport/pro/).Customizable crawlers The computer science community has produced open source webcrawlers, often as part of larger suites of programs. 'Open source' means that in addition tothe working program, the program code is also provided free of charge. This allows linkresearchers to program the crawler for specific link analysis tasks, turning the customizablecrawler into a research crawler. An example is Harvest-NG (webharvest.sourceforge.net/ng/)as used by Cothey (2005) for link analysis.

SOCSCIBOT

SocSciBot, like any web crawler, works by recursively requesting web pages (<chapter 2). Itextracts URLs from the HTML of web pages and repeats the process with each new URLfound. In this section some more information about web crawling is given, much of it specificto SocSciBot. Figure 18.1 is a screen-shot of SocSciBot in the middle of a crawl. From thetitle bar at the top, "15/28:linkanalysis.wlv.ac.uk/13.htm", it is fetching the URLhttp://linkanalysis.wlv.ac.uk/13.htm, and this is the 15th URL in its list of 28 URLs to fetch.

Web page retrieved

Web pages are retrieved over the Internet via a mechanism known as the Hypertext TransferProtocol (HTTP). Although modern web browsers can cope with other formats, such as theFile Transfer Protocol (FTP), SocSciBot only uses HTTP. An HTTP session is initiated bySocSciBot requesting an URL. For example, to request the URLhttp://www.google.com/options/index.html an HTTP message containing the informationbelow would be sent to www.google.com. The first line is a request to GET the file specified,/ o p t i o n s / i n d e x . h t m l with the third line specifying the host name of the web servercontaining it (in case the same web server looks after different hosts). The middle lineidentifies that SocSciBot is sending the request.


GET /options/index.html HTTP/1.0User-Agent: SocSciBotHost: www.google.com

Google's web server replies to this HTTP request by sending an HTTP message such as thefollowing, then sending the web page requested. The first line gives the code 200 OK toindicate that it is able to find the file and is prepared to send it. The second line identifies thefile as a plain text HTML file, i.e. a normal web page.

HTTP/1.0 2 00 OKContent-Type: text/htmlLast-Modified: Mon, 05 Apr 2004 01:30:32 GMT

If the content type of a requested file is not text/html but something else such as a MicrosoftWord document or a picture, SocSciBot will not download the file. This use of the HTTPprotocol avoids unnecessary downloading of non-HTML pages that cannot be processed bySocSciBot.

Web page qualification

Domain name matching is the default technique used by SocSciBot to decide whether tocrawl an URL. Recall that the domain name part of an URL is normally the portion of theURL between the http:// and the first subsequent slash. For example, the domain name forhttp://www.google.com/options/index.html is www.google.com. URLs normally qualify for agiven SocSciBot site crawl if their domain names end in a known domain name of the sitebeing crawled. For example, the known domain names for the University of Wolverhamptonare wlv.ac.uk and wolverhampton.ac.uk. Thus, http://www.scit.wlv.ac.uk/index.html andhttp://www.wolverhampton.ac.uk/lib/ both qualify for this site, but www.wlv.gov.uk does not.


SocSciBot also allows crawls to be specified by path. If a domain name plus a filepath is specified as the scope of the crawl, then all URLs containing the path will qualify forcrawling. For example if the scope was set to .google.com/options/ then the pagehttp://www.google.com/options/index.html would qualify but http://www.google.com/ wouldnot.

Web link extraction

Recall that web pages are coded in HTML, which is a coding language that tells webbrowsers what each web page should look like, including which links it should contain(Powell, 2003). URLs can be indicated in the text of web pages in the following ways,illustrated for the same URL.

• The anchor tag (including a client-side image map):o <A H R E F = " h t t p : / / v i v i s i m o . com/"x /A>

• A meta tag in the head of the HTML indicating redirection to an alternative page:o <META HTTP-EQUIV="refresh"

content="0;URL=http://vivisimo.com/">

• The frame tag:o <FRAME S R C = " h t t p : / / v i v i s i m o . c o m / " >

To illustrate the extraction of URLs from HTML, the following markup in the HTML of aweb page...

<A HREF="http://www.db.dk/lb/">Dr Lennart Bjorneborn</A>,Small world phenomena on the Web

...will produce the text below when the page is loaded into a web browser, including aclickable link to h t t p : //www. db . d k / l b / :

Dr Lennart Bjorneborn, Small world phenomena on the Web

SocSciBot will process the HTML, looking for the various different ways in which URLs canbe specified. It will look for the code HREF= to specify the start of an URL and will be ableto identify h t t p : / / w w w . d b . d k / l b / as a link URL because of this. It will extract theURL from between the quotes in this example.

URLs from HTTP

Outside of HTML, URLs can also be sent directly to browsers and crawlers via the HTTPmechanism. If a web page is requested that the server knows is somewhere else, then it maysend redirection information, giving the new location of the page. The example below showsthe result of requesting a page with URL http://www.db.dk/lb that the server knew wasactually at the URL http://www.db.dk/lb/ with a slash on the end.


HTTP/1.1 3 01 Moved PermanentlyDate: Fri, 04 Jun 20 04 08:20:23 GMTLocation: http://www.db.dk/lb/

When surfing the web with a web browser, the browser would automatically request the pagefrom its new location upon receipt of the redirection information. The user would not realizethat a redirection had occurred unless they saw the URL automatically change. RedirectedURLs are flagged as such in SocSciBot link structure files.

Obscured or unspecified URLs

URLs can also be accessed by web users in ways that are difficult or impossible for webcrawlers. The following are examples of these.

• Embedded programs running through the browser, such as JavaScript, Java, Shockwaveand Flash.

• Non-HTML document types with a hyperlinking capability, for example online MS Wordor PDF documents.

• Server side image maps.

Server side image maps are pictures in web pages that form an URL from the coordinates ofthe point that a user clicks on. A program on the server returns a page based upon thecoordinates sent. Although it is possible to crawl every possible pair of coordinates for anyserver side image map found, it is impractical to do this.

SocSciBot ignores all of the above. It is not possible to easily extract URLs fromembedded programs since these may be built by the code itself when running and so adecision was made to ignore all such links. This would mean that a site using this kind oftechnology without the backup of HTML links would not be covered completely.

Non-HTML documents can still be 'web pages' if they are delivered by a web serverand have an URL. The decision not to attempt to code SocSciBot to extract links from anynon-HTML documents was based upon the desire to avoid the programming complexityinvolved with supporting types less simple than HTML.

Server-generated pages

Some web pages have their URLs formed by user actions on the previous page. This is thecase when a web page or information is requested by a method other than clicking on a link,such as filling in an electronic form in a web page. A simple example of this used on oneuniversity home page is a drop-down list of choices instead of a selection of links. Theinformation chosen can be sent to the server either as part of the URL (a 'GET' request) or asinformation sent with the URL, but not forming part of it (a 'POST' request) (Powell, 2003).In the latter case many different pages can have the same URL. The same mechanism cansend information typed by the user, for example a search engine keyword query. SocSciBotdoes not attempt to extract or predict URLs from server-generated pages requested in thisway.


Dealing with errors

One aspect of a web crawler that is often only of interest to programmers are its algorithmsfor dealing with errors. There are essentially two kinds: Internet transfer errors and HTMLerrors. The former case encompasses all events that can prevent a complete web page frombeing received by the crawler, whereas the latter includes all mistakes in the web pagedesigns.Page Missing The most common Internet transfer error occurs when a web page is requestedthat does not exist. This can be identified when the HTTP header returned by the web serverincludes the code 404 for 'file not found'.

HTTP/1.1 4 04 Not FoundDate: Fri, 28 Nov 2004 23:06:39 GMT

A simple error message may also be sent in an accompanying HTML page. Such pages areflagged as missing by SocSciBot in its link structure file.Server not responding Missing pages are an example of an error with a definitive cause, butother errors are impossible to fully diagnose automatically. For example, when a web server isswitched off a request for one of its pages will not be returned with an error message: therequest will just not be answered. The reason for a request not being answered within thedefault time period (60 seconds) could be that the server is offline permanently ortemporarily, that it is very busy, or even that its part of the Internet is overloaded. Delayswhen requesting a page are a problem because a large crawl could take a long time if theindividual pages are not fetched very quickly. A crawl of 10,000 pages would take under 3hours with an average retrieval time of 1 second per page, but over a day at 10 seconds perpage and a week at 60 seconds per page. A large crawler can fetch many pagessimultaneously (known as multi-threading), so that response delays from servers do notgreatly slow down the overall crawl rate. The sending of multiple requests to the same site isnot desirable for an individual site crawler like SocSciBot because it may overload the webserver being crawled. As a result, SocSciBot requests only one page at a time, but has astrategy for minimizing the delays caused by servers that are not responding. The technique isto log the domain name of each URL that cannot be retrieved, and then put the unretrievedURL at the back of the queue to be attempted again after all other URLs have been fetched. Ifthe server is still not responding, then it is recorded as a dead server and no further URLs willbe requested from it.Markup Errors Although HTML is an official language with agreed rules for elements,including links, this does not stop web designers from making errors in their pages. Someerrors are so common that web browsers automatically correct them. One example isforgetting to close qjotes at the end of a tag. SocSciBot does not stick rigidly to officialHTML, but attempts to correct all the errors that it can. For example, if it finds the end of aline in the middle of an URL, then it will assume that the designer has forgotten to put inquotes to close the URL and will assume that the URL finishes at the end of the line sinceURLs cannot span multiple lines.


Human intervention during crawls

It is impossible to run a web crawler fully automatically if its purpose is to cover an entiresite. This is because web servers can create new pages upon request, including links to furthernew pages. A simple example of this is an online calendar linked to a database of activities.The address of each page in this example could be an encoded version of the date and it couldcontain links to the next day. This situation occurred during a SocSciBot crawl of oneuniversity, with the crawler eventually requesting pages from days during the year 2030before human intervention stopped it. During large crawls, the URLs SocSciBot visits need tobe checked to ensure that no infinite crawling is occurring. URLs causing loops can be addedto a banned list of areas that the crawler is instructed to ignore.

It is important to realize that the use of the banned list changes site outlink counts.This is particularly relevant when mirror sites are banned. A commercial web crawler may beexpected to take measures to avoid crawling multiple versions of large mirror sites, but theinevitable partial implementation of such a policy does create a source of uncertainty in theirresults. If mirror sites were left in SocSciBot crawl results, they may cause problems insubsequent link analyses. For instance it is common practice to put a link to the home page ofthe creating organization on all mirrored pages. This creates a large anomaly when countinglinks. Banning these pages may be justifiable on the grounds that the organization owning thehosting web site has not authored the work that it has mirrored.

SOCSCIBOT TOOLS

The software tools available with SocSciBot can process the link files the crawler creates.Many different types of analysis can be performed, including those below.

• Data cleansing to remove unwanted links.• Extracting subsites (e.g. physics departments from whole university sites).• Counting links between sites using the ADMs.• Counting site inlinks using the ADMs• Reporting the most frequent link targets.• Removing internal site links.• Summarizing link targets by domain or top level domain.• Calculating PageRank statistics.• Calculating topological components (e.g. IN, OUT, CONNECTED).• Finding the link diameter of collections of pages and longest shortest paths.• Converting link structure files to Pajek and UCINET input files.

The advantage of SocSciBot Tools is the range of link analysis functions that it offers. It hastwo significant disadvantages. First, it is difficult to use. This is because some of the tasks itperforms are complex with many options to be selected, and also because its interface anddocumentation have not been designed by professionals. A second drawback is that some ofthe analyses take a long time to perform: hours or days. This is partly due to the taskcomplexity and partly the fault of inefficient programming. For those who master SocSciBotTools, however, a world of advanced link analysis will open up.


SUMMARY

For simple link analysis, a commercial crawler that offers sufficient functionality would be agood choice. For advanced analyses, SocSciBot and SocSciBot Tools are recommended forthe functionality of the tool set. For researchers wishing to go one step further and invent newtypes of analysis, a customizable open source web crawler is recommended.

ONLINE RESOURCES

The online component of this chapter contains links to SocSciBot and SocSciBot Tools,including their manuals. Links to other types of personal crawler are also givenhttp://linkanalysis.wlv.ac.uk/18.htm. Figure 18.2 shows the SocSciBot home page, as of July,2004, and figures 18.3 and 18.4 illustrate its associated software: SocSciBot Tools andCyclist.

Table 18.1. Information available in the online component of this book.Information/UseSocSciBot link crawler program.To crawl web sites for link analysis purposes.This can be instead of using commercialsearch engines for link count statistics, forcomparison with commercial search engineresults, to obtain more controlled results, or toconduct more sophisticated subsequentanalyses.SocSciBot Tools link analysis program.To conduct various link analyses on the websites crawled by SocSciBot.Cyclist search engine for SocSciBot data.To search the web sites crawled by SocSciBot.Links to other personal crawlers.To select alternative crawlers, if SocSciBot isnot appropriate for any reason.

URLs/Exampleshttp://socscibot.wlv.ac.uk

http://socscibot.wlv.ac.uk

http://socscibot.wlv.ac.uk

The Harvest-NG open source web crawler.http://webharvest.sourceforge.net/ng/The tucows collection of mainly sharewarepersonal crawlers (offline browsers).http://www.tucows.com/offline95_default.html

FURTHER READING

The SocSciBot section of this chapter is adapted from a paper in the e-journal Cybermetrics(Thelwall, 2003). See the appendix of this book for a tutorial walkthrough of the capabilitiesof SocSciBot and its associated software.


Figure 18.2. A screen-shot of the SocSciBot home page.

REFERENCES

Arroyo, N. (2004). Evaluation of commercial and academic software for webometricpurposes, WISER technical report. Available at: www.wiserweb.org.

Cothey, V. (2005, to appear). Web-crawling reliability. Journal of the American Society forInformation Science and Technology.

Powell, T. (2003). HTML & XHTML: The complete reference. New York: McGraw-HillOsborne Media.

Thelwall, M. (2003). A free database of university web links: Data collection issues.Cybermetrics, 6(1). Available:http://www.cindoc.csic.es/cybermetrics/articles/v6ilp2.html.


Figure 18.3. A screen-shot of SocSciBot Tools.

Figure 18.4. A screen-shot of Cyclist after a search for "link".

Data Cleansing 199

19

DATA CLEANSING

OBJECTIVES

• To review methods for removing anomalies from link data sets described in previouschapters.

• To introduce some new data cleansing methods.

INTRODUCTION

Data cleansing is an almost inevitable part of any investigation using a large quantity of data,whether for statistical or data mining purposes (Pyle, 1999). Data cleansing refers tooperations designed to improve the usefulness of a data set for a particular function. It mayinvolve removing individual data items that a human, automated or semi-automated processjudges to be undesired, as well as choosing a perspective with which to analyze the data thatminimizes the impact of anomalies. This chapter pulls together information related to datacleansing already discussed in previous chapters and introduces some additional techniques. Itis an extension of chapter 3.

OVERVIEW OF DATA CLEANSING TECHNIQUES

The first data cleansing stage can occur during the crawl itself. For instance, SocSciBotautomatically checks for and rejects duplicate pages, and has an automatic filter to excludeURLs containing a question mark. It also allows additional sets of URLs to be excluded bythe operator during the crawl (<chapter 18). This manual filtering needs to be conducted witha clear theoretical perspective (<chapter 3) in order to judge the kinds of URL that should beexcluded. For example, pages not created by a site's owners and pages automaticallygenerated by electronic equipment, such as web server log file analysis programs, may beamongst the types to be ignored.

Once a crawl is complete, additional manual filtering may be needed to remove pagesfrom the data set that are of undesired types but which were not noticed during the crawl. Forlarge data sets, it will not be possible to visit every page crawled for a validity check, but


automated techniques may be used to identify sets of URLs that have the largest impact on thedata so that they can be manually checked. This is discussed in more detail below, in theanomaly identification section.

The final data cleansing stage is the choice of Alternative Document Model forcounting links (<chapter 3, "choosing link counting strategies" section). As discussed inchapter 3, this can be achieved by correlation tests, rational argument or a special technique,TLD spectral analysis.

ANOMALY IDENTIFICATION

Although it may not be possible to check every page in a large data set, computer-aidedchecking can ensure that the most influential potential anomalies can be identified. Of course,the characterization of a page or set of pages as an anomaly depends upon the theoreticalperspective.

The options available to support anomaly identification depend upon the dataavailable. If it is possible to justify using a theoretically motivated mathematical model, thenthis may be used to identify for manual inspection data that is furthest from the predictions ofthe model. For example, if counts of links between pairs of universities are expected to beproportional to the product of the universities' research productivities then linear regressioncan be used to fit the link count data to this product and Mahalanobis distance (Tabachnick &Fidell, 2001, p68) used to identify the greatest outliers for inspection. This will identify pairsof universities where the count of links between them is significantly higher or lower thanwould be expected from their research productivities. Manual inspection of the pages hostingthe links may reveal a cause and lead either to additional URL filtering, removing the linkcount data for this pair of universities from the data set, or leaving the data unchanged if it isvalid.

Without a mathematical linking model then an alternative data cleansing technique isto manually investigate the highest link counts, accepting that the lower link counts are alsolikely to contain anomalies but that these are probably less influential than the larger linkcount anomalies. This is potentially problematic because the selective strategy may influencethe overall results and so should be used with care. There are several different types of linkcount that may be used to identify the highest linking values.

Site Minks In a large-scale link analysis exercise, the sites with most inlinks may beinvestigated. In most cases this will not be desirable, because site inlinks are often the objectof primary interest in an investigation and so investigating only the highest cases may be toodirect an interference with the data. It is also difficult to do because the investigation of allinlinks for a single site may require visiting many different sites that host the inlinks.

Page inlinks A ranked list of the highest inlinked pages either inside the data set (i.e. in one ofthe sites crawled) or outside the data set (i.e. anywhere in the web) is a good way to identifypotential anomalies (<chapter 13). Investigating why each of, say, the top 100 inlinked pagesis highly inlinked may reveal anomalies that can be filtered out. One type of anomaly thatmay be identified by this approach is the mirror site, particularly mirror sites that contain anacknowledgement link to the source site on each page. Directory ADM and domain ADMinlink-ranked lists can also be used.

Data Cleansing 201

Inter-site link counts A ranked list of pairs of sites by the count of links between them (usingany ADM) gives a relatively easy to investigate data source because each anomaly will haveonly one source university web site, which makes finding the link source pages easier.

A data cleansing exercise may use any or all of the above approaches and the investigationswill clearly be time consuming. The end result will be higher quality, more extensivelyfiltered data that should yield better results.

TLD SPECTRAL ANALYSIS

In addition to anomaly identification, the choice of ADM is part of data cleansing. Correlationtechniques for this are discussed in chapter 3, but in this chapter an alternative technique isdiscussed, TLD spectral analysis. This is a technique designed to choose the ADM that ismost appropriate to analyze a given set of web sites. TLD spectral analysis can be used whenthere is no external source of data for correlation tests. When such external data is available,the chapter 3 techniques should be used instead.

Essentially, TLD spectral analysis compares the distribution of top-level domainstargeted by the sites crawled using each of the ADMs and selects as best the ADM thatproduces the least variation in top level domain (TLD) distributions. There are three basicassumptions underpinning the method, collectively known as the independent TLD targetdistribution model (Thelwall, 2005).

1. Each web site is constructed from a finite collection of "documents", which are notnecessarily web pages.

2. Site outlink TLDs obey a common probability distribution across the documents of allthe web sites.

3. Site outlink TLDs are statistically independent of each other.

The importance of the first point is that the documents comprising web sites may correspondto web pages, domains, directories or sites, but this is not known in advance. If the abovestatements are true for any given collection of web sites and if the documents in the sites arecorrectly identified, then the proportion of outlinks from each web site that target each TLDshould be approximately the same. To test how well an ADM fits a set of web sites theseproportions can be calculated for each site and then the variation of the proportions betweenthe sites calculated. The ADM for which these proportions vary least (e.g. using standarddeviations) is chosen as being the document model that best fits the three assumptions above.See Thelwall (2005) for full details.

SUMMARY

This chapter has described a range of data cleansing techniques. The ADMs are a globalapproach to count links in a way that gives less scope for anomalies. The other techniquesidentify the most influential anomalies for manual inspection. The importance of effectivedata cleansing should not be underestimated: in the related area of data mining it has been


suggested that data cleansing is likely to consume a significant proportion of all time taken toanalyze data (Pyle, 1999).

Care should be taken to ensure that anomaly removal does not become eliminating allawkward data points: a theoretical justification is needed to support removal decisionsotherwise the validity of the cleansed data for addressing the research questions iscompromised.

ONLINE RESOURCES

The online component of this chapter contains information about data cleansing withSocSciBot Tools http://linkanalysis.wlv.ac.uk/19.htm.

Table 19.1. Information available in the online component of this book.InformationInstructions for data cleansing with SocSciBotTools.

UseFor use when conducting any link analysisinvestigation with SocSciBot.

REFERENCES

Pyle, D. (1999). Data preparation for data mining. San Francisco, CA: Morgan Kaufmann.Tabachnick, B. & Fidell, L. (2001). Using multivariate statistics, 4th edition. Needham

Heights, MA: Allyn and Bacon.Thelwall, M. (2005). Data cleansing and validation for Multiple Site Link Structure Analysis.

In: Scime, A. (Ed.), Web Mining: Applications and Techniques. Idea Group Inc., pp.208-227.

Online University Link Databases 203

20

ONLINE UNIVERSITY LINK DATABASES

OBJECTIVE

• To introduce the cybermetrics free online link databases.

INTRODUCTION

The cybermetrics free online databases are a large collection of files recording the linkstructures of university web sites, as crawled by a version of SocSciBot(http://cybermetrics.wlv.ac.uk/database/). The main difference is that the variant of SocSciBotused to create the cybermetrics databases has a larger limit to the number of URLs it candownload in a single crawl: 900,000 at the time of writing. The databases are placed onlinenormally within a few days of the end of the crawl. The purpose of this free online resource istwofold.

• To make link data available to link analysis researchers or students for practice.• To make link data available to link analysis researchers for research.

The databases provide quick access to a large and rich source of data for link analysis so thatresearchers can experiment with the techniques described in the book without having toconduct an extensive crawling exercise. The databases can also be used by anyone forpublished research.

OVERVIEW OF THE LINK DATABASES

The databases available in the site each represent a systematic crawl of university web sites ina single country over a period of up to three months. For example, database 15 contains thelink structure of 38 Australian universities, crawled in February 2004. This is the result of 38separate crawls, one for each university. The choice of universities to crawl in a country is notalways simple for the following reasons.


• Not all universities contain the word 'university' in their title.• The status of individual institutions or whole classes of institution can change.• Some countries, such as the USA do not have a special legal status for universities,

whereas others have a tiered system of higher education that does not fit neatly into abinary university/non-university divide.

Australia is an example of a relatively straightforward binary divide between university andnon-university institutions. Nevertheless, there are some non-standard university names, suchas the Royal Melbourne Institute of Technology, and also changes. Between 2003 and 2004 anew university was created: Charles Darwin University.

The country coverage of the cybermetrics database is not systematic: the UK,Australia and New Zealand are represented by annual crawls, but Spain, Taiwan and mainlandChina had only a single crawl and other countries are not represented (in 2004). This partialcoverage is because crawling university web sites is time-consuming and hence universityweb sites have only been crawled to address specific research questions. The USA is aconspicuous omission: a result of its large number of universities.

Each link database is accompanied by a list of the domain name(s) of the universitiescrawled. Most universities have just one official domain name ending. Even though there maybe many domain names, they all tend to end in a common university identifying part, such asindiana.edu. Some universities have more than one domain name, however, and multipledomain names are recorded in a separate file. The knowledge of these is important for linkcounting, and SocSciBot Tools needs to know about multiple domain names to correctlyprocess the link structure files. Wolverhampton University is an example of a university withmultiple names, a long and a short version: wlv.ac.uk and wolverhampton.ac.uk.

LINK STRUCTURE FILES

Each database is a set of plain text files, one for each university, combined into a single zipfile. Depending upon the country, this varies in size from tens of megabytes to hundreds ofmegabytes. Each university text file records only the link structure of the web site and not itstext or other information. A simple shorthand is used for URLs to save space, as follows. Theinitial "http://" at the start of an URL is removed and also for any URLs beginning with"http://www.", the "www" is also removed (without the dot). The examples below show twoURLs in the full and then the shorthand form.

Full URLs:http://www. indiana.edu/~tisj/index.htmlhttp://mail.asis.org/pipermail/ourchap/

Shortened URLs:.indiana.edu/~tisj/index.htmlmail.asis.org/pipermail/eurchap/

The structure of each file is a list of link URLs extracted from a page, followed by the pageURL. Links are identified by being indented (with a tab) but page URLs are not indented. Atab and then one of the codes 1, 2, 3 or 5 follow the URL of each crawled page.


1 A valid HTML page.2 An error occurred when retrieving the page that could not be resolved.3 A redirection of the URL. This occurs when the web server sends a message to the

web browser indicating that the URL has moved to another location. This is aHyperText Transfer Protocol (HTTP) command (<chapter 18).

5 A valid non-HTML document, such as an image or a Microsoft Word document. Nolinks are ever extracted from non-HTML documents.

The following example illustrates two pages, the first an HTML page with two links, and thesecond a valid non-HTML page with no links. An HTTP redirection request is recordedfollowing these and then a fatal error.

.indiana.edu/~tisj/readers/special.html

.indiana.edu/~tisj/readers/topics.html.indiana.edu/~tisj/index.html 1.iuf.indiana.edu/report/IUFreport.pdf 5

.indiana.edu/~tisj/.indiana.edu/~tisj 3.indiana.edu/~tisj/home.htm 2

It is important to remember that each university web site link structure file will be anincomplete representation, typically covering only the publicly indexable set, and excludingboth duplicate pages and pages with URLs matching the banned list. The fact that it is stillmeaningful to study these sets is supported by statistical correlation tests (< chapter 8).

THE BANNED LISTS

Each link database is accompanied by a 'banned list' of URLs that were excluded from thecrawl through being a mirror site or because they were another unwanted page type (<chapter2). Below is an extract from one of the New Zealand banned lists.

[waikato.ac.nz].cs.waikato.ac.nz/~jcleary/230/jdkdocs. cs.waikato.ac.nz/~remco/data/.cs.waikato.ac.nz/~syeates/bin

At the head of each list in square brackets is the domain name of the university with which thelist is associated. Underneath is a list of URLs that the crawler has been instructed to ignore.The URLs are given in shorthand form, and an URL is ignored if it matches any in the list.The match does not have to be a full match; a shorthand URL is ignored if it matches the fulllength of the banned shorthand URL. To illustrate this matching process, the shorthand URL. cs . w a i k a t o . ac . n z / ~ r e m c o / d a t a / s e t l .htm would be ignored because itmatches the whole of the second shorthand URL in the list above, even though it is longer.However, the shorthand URL . cs . w a i k a t o . ac .nz /~remco/ would not be ignoredbecause it does not fully match any of the shorthand URLs in the list.


ANALYZING THE DATA

The link structure files provided in the Cybermetrics database can be analyzed in variousdifferent ways using the SocSciBot Tools software (<chapter 18) that also analyses the linkstructure files created directly by SocSciBot, because the formats are the same. Instructionsfor carrying out individual analyses are given online so that the software can retain theflexibility to adapt and change. As a general rule it is worth noting, however, that some of theanalyses require a lot of hard disk space to store the information created and may take a longtime to complete. All are welcome to write their own computer programs to analyze the data.

OTHER LINK STRUCTURE DATABASES

There do not appear to be any other free link structure databases on the web. There are,however, large coherent collections of web pages online, from which programs may extractlink structures. The TREC (http://trec.nist.gov/) web data sets are currently perhaps the bestknown, at least in computer science.

SUMMARY

The cybermetrics database and its tools provide a free online resource for students andresearchers to access and use. Its principle disadvantages are the large size of the data sets, theawkwardness of the software tools and the limited number of countries crawled. It is hopedthat this provides a valuable compliment to the theory presented in this book and may allowresearchers to investigate further into the link structure of academic webs.

ONLINE RESOURCES

The online component of this chapter contains information about known link structuredatabases on the web, as well as links to the cybermetrics data set and tools and instructionsfor processing it http://linkanalysis.wlv.ac.uk/20.htm. Table 20.1 summarizes the types ofinformation available, and figures 20.1 and 20.2 illustrate sections of the Cybermetricsdatabase site.

FURTHER READING

This chapter is partiy based upon an article published about the database (Thelwall, 2003).See also the sections on SocSciBot in this book.


Table 20.1. Information available in the online component of this book.

Figure 20.1. The Cybermetrics database home page (top).

Information/UseLink to the cybermetrics database.To use to conduct large scale or longitudinal linkanalyses without having to crawl many largeuniversity webs. Also appropriate for use in teachingso that students do not have to crawl for their owndata.SocSciBot Tools and some instructions and a tutorialfor using SocSciBot Tools with the cybermetrics dataset.For processing the cybermetrics data for variouskinds of link analysis.Links to any other relevant link analysis databases.For use when cybermetrics data is not appropriate, orfor comparison purposes.

URLshttp://cybermetrics.wlv.ac.uk/database/

http://cybermetrics.wlv.ac.uk/database/andhttp://socscibot.wlv.ac.uk/

None found yet (July, 2004)


Figure 20.2. The Cybermetrics database home page (middle).

REFERENCE

Thelwall, M. (2003). A free database of university Web links: Data collection issues.Cybermetrics, 6. Available: http://www.cindoc.csic.es/cybermetrics/articles/v6ilp2.html

Embedded Link Analysis Methodologies 209

21

EMBEDDED LINK ANALYSISMETHODOLOGIES

OBJECTIVES

• To demonstrate the importance of embedding link analysis in a wider theoreticalframework to effectively address social issues.

• To review two methodologies that include an embedded link analysis: web sphere analysisand virtual ethnography

INTRODUCTION

A number of research methods have evolved from a social science perspective that analyzelinks by embedding them in specific social contexts. The link analysis method of this bookuses aspects of social context as an essential ingredient (e.g. correlation tests and linkcategorization), but the differences are in the less central role of links in the social sciencemethods and in typically much smaller scale studies. In this chapter, two developed socialscience methodologies are reviewed. The first, web sphere analysis (WSA), can have "makingsense of linking practices" (Foot, Schneider, Dougherty, et al., 2003) as a key objective. It isvery close to some of the techniques described in this book, but analyses of the dynamicnature of linking practices within a specified overall framework. Hine's (2000) virtualethnography is a large stride further away from the direct analysis of links and towards theindividual social contexts of authors. The object of study is human behavior around an issuefor which online activities are important. Web publishing is one potential online action, andhence link creation, but links are not accorded a special status. Hine's methodology is not,therefore, a type of link analysis, but is a link-related methodology that is useful for a moreholistic picture of the role of link creation within society.


WEB SPHERE ANALYSIS

Web sphere analysis is an approach to studying online behavior around a specific topic thatcan be applied to links as a central object of study or to links in conjunction with other webphenomena. The key aspects of WSA are as follows.

• Topic-centered A web sphere is a collection of sites relevant to a theme, event or concept.The pages may be found using various sources, such as Google searches and link lists. Alarge site may be only partly in the web sphere if some of its content is not relevant.

• Dynamic The boundaries of the web sphere can change over time as new sites are foundand added, whilst others are excluded as they die or become irrelevant. This is verydesirable for tracking a fast moving topic because newly emerging phenomena that beganafter the start of the project can still be identified. Dynamic data sets are moretroublesome to analyze, however.

• Classification-based One quantitative aspect of web sphere analysis is content analysis-style link classifications. Other, non-link types of investigation can also be applied to aweb sphere. In principle any type of analysis could be applied: the purpose of the websphere is to define the boundaries of the set of sites to be studied (Schneider & Foot,2004).

• Longitudinal Identifying changes or evolution over time is a key goal. For example,discovering changes in linking practices over the course of a study is a suitable research

A key requirement for the web sphere is the ability to find the sites that should be part of it.This rests in part upon the imagination of the researchers to conceive of types of web site thatare likely to be relevant and to find these sites on the web.

An example of a web sphere analysis is one constructed for the 2002 US electioncandidates (Foot, Schneider, Dougherty et al., 2003). This tracked the evolution ofcongressional candidates' individual web sites over a three-month period. Linking practiceswere found to be very varied and not genre-bound, with linking often being a form ofrecognition. Hyperlinking did not appear to have become enough of an important issue forthere to be a perceived need for standardization, e.g. within a single political party. Increasedstandardization in web publication, as the importance of the web is recognized for elections,seems to be a logical eventual development.

Web sphere analysis is particularly suited to fast moving events. It has features incommon with departmental level link analysis studies (<chapter 10), principally the focus onsites around a topic, but the departmental link analysis studies have tended to be static ratherthan longitudinal.

VIRTUAL ETHNOGRAPHY

Ethnography is a qualitative methodology, centered on a researcher immersing herself in asituation to be studied and exploring, amongst other things, relationships between people andthe perceptions of the individuals involved. An attraction of the ethnographic approach is itsability to generate deep and culture-sensitive insights into a particular situation. Its drawback

Embedded Link Analysis Methodologies 211

is the necessary localization to a single setting. Ethnographic studies can always be accused ofnot being generalizable, or of choosing a situation that is in some way untypical. Similarly,larger scale studies can always be accused of being oversimplifications of a complexsituation. Ethnography is a valuable tool for its insights, and critics can test these insightsthrough more generalizable follow-up studies.

Hine's (2000) virtual ethnography is a loose framework for web research.Ethnography is, in its nature, culture and situation-specific and so does not have a prescriptiveset of methods, although ethnographers can share methods and learn from each other'sexperiences. The main tenet of virtual ethnography is a holistic approach. Everything relevantto the situation studied may be included in the investigation. For example, Hine hasinvestigated how the Internet was used by people interested in a particular media event: thetrial in the USA of the British nanny Louise Woodward. Her investigation, which took placeduring the main events of the trial, included web pages, newsgroup posts and direct emailexchanges with people identified as relevant Internet users. The analysis took into accountother media, such as television reporting, because these were common sources of informationfor those posting online. Questions asked through email were deliberately open-ended inorder to get the respondent's perspective. To illustrate the results, individual webmasters ofsites relating to Louise Woodward variously commented that: their chosen web design stylewas influenced by their HTML knowledge, and time available to learn new software, or bythe style of writing used in their job; their site was created because of a desire to do somethingand the availability of Internet publishing software; feedback from visitors was a powerfulmotivation to maintain the site; they had thought very carefully about the site layout and itspotential impact on visitors; they guessed what potential visitors would want to see. Mostwebmasters commented on the impact of various time constraints on their ability to keep theirsite up to date.

Links are one aspect of Hine's (2000, pl05-108) study and give interesting results.Perhaps partly because the investigation could only analyze web sites that could be found byan investigator, their webmasters tended to have conscious marketing strategies for their sites.These included placing their site URL in their emails in addition to offline URL distributionand registration with search engines. Links from other sites were also seen as sources of newvisitors, yet most amateur support sites prominently linked to the unofficial main support site.This is an interesting phenomenon since it would presumably drain visitors from theunofficial site. Presumably the link to the main support site was to confirm the genuineintentions of the source site or in the belief that the main site would contain additional usefulinformation, indicating a genuine (if partisan) desire to inform the visitor.

Virtual ethnography offers the potential to gain insights about linking motivations,which, as yet, can only be inferred from the results of link classification studies. A virtualethnography with an academic link related theme would be very welcome.

SUMMARY

Web sphere analysis and virtual ethnography are both general research frameworks in whichlinks are one potential object of study. Both are topic-centered, but virtual ethnography isoriented towards analyzing social interactions relating to online behavior. WSA emphasizesthe dynamic nature of the web, allowing the set of web sites studied to change during the timescale of the research, and discussing evolving practices. Virtual ethnography places theindividual centre-stage and explores their motivations and influences.


FURTHER READING

Web sphere analysis is an evolving approach, but Foot, Schneider, Dougherty et al. (2003) isa good article to read for its focus on links. A more detailed discussion of the methodologyitself can be found in Schneider & Foot (2004). Virtual ethnography is described andillustrated in Hine's (2000) book. Her subsequent edited volume (Hine, 2004) is a good placeto look for variety of Internet-related social science methods. The chapter of Jankowski andvan Selm (2004) is welcome for its insistence on the value of applying existing tried-and-tested social science research methods to the Internet, rather than creating new methodssimply because there is a new object of study. For an alternative sociological perspective onthe web, Burnett and Marshall's (2003) book is an interesting read.

The definitive work on academic use of the Internet, drawn from a mainly qualitativeperspective, is Nentwich's (2003) Cyberscience. This is useful for the range of methods used,in addition to the vast array of findings reported.

REFERENCES

Burnett, R. & Marshall, P. (2003). Web theory: An introduction, Routledge, New York.Foot, K., Schneider, S., Dougherty, M., Xenos, M., & Larsen, E. (2003). Analyzing linking

practices: Candidate sites in the 2002 US electoral web sphere. Journal of ComputerMediated Communication, 8(4). http://www.ascusc.org/jcmc/vol8/issue4/foot.html

Hine, C. (2000). Virtual Ethnography. London: Sage.Hine, C. (Ed) (2004). Virtual Methods: Issues in Social Research on the Internet, Berg:

Oxford.Jankowski, N. & van Selm, M. (2004, to appear). Epilogue: Methodological concerns and

innovations in Internet research. In Hine, C. (Ed), Virtual Methods: Issues in SocialResearch on the Internet, Berg: Oxford.


Schneider, S. & Foot, K. (2004, to appear). Web Sphere Analysis: An approach to studyingonline action. In Hine, C. (Ed), Virtual Methods: Issues in Social Research on theInternet, Berg: Oxford.

Social Network Analysis 213

22

SOCIAL NETWORK ANALYSIS

OBJECTIVES

• To introduce and define some basic network measures and techniques that originate insocial network analysis (SNA).

• To describe the functionality offered by SNA software packages.

INTRODUCTION

Social network analysis is a methodology that has evolved to study social groupings,particularly in terms of social and communication connections within a group. The 'nodes' ofan SNA investigation may be individual people, with connections recorded as links betweenthese nodes, forming a network or graph that is mathematically equivalent to any other kindof network, including one of web pages and links. Within SNA a set of standard measures ofaspects of social networks has evolved, many of which can be transferred to the web. Inrecognition of this, the field of hyperlink network analysis (HNA) (Park, 2003) has evolved toapply SNA methods to the web. This has the advantage that SNA brings a pre-existing set oftechniques for studying networks, together with software to perform the necessarycalculations. SNA has evolved with a set of assumptions about human interactions that do notdirectly transfer to web pages, however (Park & Thelwall, 2003). For example, SNA networkconnections may be communication channels such as phone conversations or personalfriendships, whereas hyperlinks do not necessarily imply any exchange of informationbetween the authors of the source and target pages of the link. Hence, each SNA techniquemust be re-evaluated to be used in a web context.

Although in principle SNA and HNA can be applied to networks of any size, they aremore suited to the analysis of smaller networks. This is because they are closely tied to themeaning of the individual connections in a network, and in larger networks, (and particularlyin the web) the range of meanings and types of connection will be naturally more diverse.


SOME SNA METRICS

In this section, a selection of common SNA metrics are defined and described. There is a verylarge collection of SNA metrics in existence so this list is necessarily partial. It is alsorestricted to those that seem most applicable to the web.

Two preliminary definitions are necessary before introducing the first metric. A pathbetween two nodes in a network is a contiguous chain of links, starting at the first and endingat the last. A shortest path between two nodes is a path between them that has the minimumpossible length.

Betweenness centrality is a measure of how important a node in a network is forconnecting other nodes in a network. The betweennness centrality of a mode in a network isthe probability that the node will occur in a shortest path between two nodes in a network(Freeman, 1977 from Bjorneborn, 2004). This notion does not automatically transfer to theweb as meaningful, unless the concept of shortest path transversal is important for theresearch context. In many situations the ability of web users to follow paths of links would beirrelevant, and so it would not make sense to use betweenness centrality. It is clearly usefulfor navigation-related questions, however. There are many other SNA metrics that apply onlyto investigations that involve tracking paths through data, such as measures of closeness,centrality, information centrality and influence centrality. Betweenness centrality has beenused by Bjorneborn (2004).

Freeman's (1977) degree centrality is an assessment of how central a node is to anetwork. There are three types of degree centrality. The indegree centrality of a node is itsinlink count. The outdegreee centrality of a node is its outlink count, and the symmetricdegree centrality of a node is the sum of its inlink and outlink counts. These statistics do notreally give an advantage over the more direct name of 'inlink counts' and 'outlink counts' butin SNA they are often used to rank nodes, which is a useful simple technique. Degreecentrality has been used by Park, Barnett and Nam (2002).

A clique in a network is a collection of nodes that all link to each other. Cliques ofsize greater than 3 are probably quite rare in the web but are much more common in socialnetworks, where groups of friends that know each other can be frequently found. Cliques areusually calculated without using the direction of links. Weaker notions of cliques aresometimes used, where each node must be able to connect to the other nodes through a path oflength up to n, an n-clique (Luce, 1950). A related idea is Rousseau's Escher staircase, whichis a set of four nodes that are arranged in a cycle through reciprocal links, where direction isimportant (Rousseau & Thelwall, 2004). See Figure 22.1 for examples of these three types ofnetwork phenomena. In each case the examples may be extracted from much larger networks.

Figure 22.1. A clique, a 2-clique and an Escher staircase.

A k-core is a subset of all the nodes in a network such that each node is linked to atleast k nodes in the same subset (Seidman, 1983). A k-core is a highly interlinked collection


of nodes within a larger network. The k-core is a relatively arbitrary measure since there is nonatural choice for k. Nevertheless, it is often useful to split a large network into more coherentsubnetworks or to identify subnetworks that appear to be highly related, and the k-coreheuristic appears to be reasonable for this purpose. There are very many link-based node-clustering algorithms, but the k-core has the advantage of simplicity of explanation. K-coreshave been used by Bjorneborn (2004).

Quadratic Assignment Procedure (QAP) correlation is a measure to compare thesimilarity of two networks (Krackhardt, 1988). It has been used to test whether hyperlinknetworks are significantly similar to other networks involving the owners of the web sites. Forexample, QAP correlation could be used to test for a significant relationship betweengeographic distance between universities and counts of links between their web sites.Standard correlation techniques should not be used to compare matrices representingnetworks because the data in rows and in columns is typically related, leading to a biased test(Krackhardt, 1988). The QAP procedure seeks to avoid this by bootstrapping approach:comparing the similarity of the matrix pair with their similarity after random permutations ofthe rows and columns of one of them. The QAP correlation reports the proportion of thesepermutations that produce less similar pairs of networks. QAP correlation has been used byTang and Thelwall (2003).

Clustering coefficients are not strictly part of SNA but are metrics that are alsopotentially useful to analyze web networks, and particularly to compare two networks. Thesemeasure the tendency of the nodes in a network to cluster. For example, one simple clustermeasure (Watts & Strogatz, 1998) for a single node is to calculate the proportion of nodeslinked to it that are linked to each other, sometimes called the clustering coefficient of thenode. This measures how well interconnected a node's neighbors are. Averaging theseproportions over all nodes in a network gives a measure of the clustering tendency of anetwork, known as the network clustering coefficient. Network clustering coefficients fordifferent networks can be compared to assess their relative degrees of clustering. Bjorneborn(2004), and Watts and Strogatz (1998) have used this approach to investigate the properties ofsmall-world networks.

SOFTWARE

The dominant SNA software package at the time of writing was UCINET(http://www.analytictech.com/). This program, available for a small charge, can carry out awide range of SNA procedures on a set of network data, including data exported bySocSciBot Tools in UCINET format. UCINET's documentation is extensive and referencesthe sources of calculations performed with academic papers in which they are described. It istherefore perfect for academic SNA and HNA analyses, even for non-specialists. WithUCINET, it is possible to load a data set and then experiment with the different optionsavailable. This is recommended as a valuable learning experience, not only to learn thesoftware, but also to learn about the techniques. UCINET also runs some standard statisticaltechniques such as multidimensional scaling. As with all complex software, it will not bestraightforward to use at first, but its features are likely to repay the effort of learning.

It seems unlikely that any other software package will replace UCINET in the nearfuture simply because of its range of functions. There are other tools available, however, thatoffer overlaps in functionality with UCINET. For example, some network visualization


programs, such as the Slovenian program for large network analysis 'Pajek', can also offer arange of network-based statistics, although not necessarily derived from SNA.

SUMMARY

Social network analysis has been described here as merely a set of tools and metrics for thequantitative evaluation of networks. In reality it is a very mixed quantitative-qualitative field,but the qualitative theory will not translate to most web networks because they are not socialor communication networks. SNA has produced a wide variety of quantitative techniques butthey should be used carefully because in many cases their meaning is dependant upon thenetwork analyzed being one in which links are communication connections, allowinginformation to be transferred along link paths.

FURTHER READING

For those who wish to know more about SNA, an introduction is provided by Otte andRousseau (2002) for information scientists. The reference manual for the program UCINET(Borgatti, Everett & Freeman, 2002) is a goldmine of information about social networkanalysis, and is very useful as a reference source.

An interesting and extensive theoretical discussion and case study of SNA techniquesapplied to the web can be found in Bjorneborn's (2004) Ph.D. thesis. A further SNA-hyperlink network analysis investigation is that of Garrido and Halavais (2003). A review ofhyperlink network analysis research can be found in Park & Thelwall (2003).

REFERENCES


Borgatti, S.P., Everett, M.G. & Freeman, L.C. (2002). Ucinet for Windows: Software forSocial Network Analysis. Harvard: Analytic Technologies.

Freeman, L. (1977). A set of measures of centrality based on betweenness. Sociometry, 40,35-41.

Garrido, M. & HaSavais, A. (2003). Mapping networks of support for the Zapatistamovement: Applying social network analysis to study contemporary social movements.In: M. McCaughey & M. Ayers (eds). Cyberactivism: online activism in theory andpractice (Routledge, New York, pp. 165-184).

Krackhardt, D. (1988). Predicting with networks: Nonparametric multiple regression analysisof dyadic data. Social Networks, 10, 359-382

Luce, R. (1950). Connectivity and generalized n-cliques in sociometric group structure.Psychometrika 15, 169-190.

Otte, E. & Rousseau, R. (2002). Social network analysis: a powerful strategy, also for theinformation sciences. Journal of Information Science, 28(6), 441-454.


Park, H.W., Barnett, G.A. & Nam, I. (2002). Hyperlink-affiliation network structure of topweb sites: Examining affiliates with hyperlink in Korea. Journal of the AmericanSociety for Information Science and Technology, 53(7), 592-601.

Park, H.W. & Thelwall, M. (2003). Hyperlink analysis: Between networks and indicators,Journal of Computer-Mediated Communication, 8(4).http://www.ascusc.org/jcmc/vol8/issue4/park.html

Park, H.W. (2003). What is hyperlink network analysis?: New method for the study of socialstructure on the Web. Connections, 25(1), 49-61. Available:http://www.sfu.ca/~insna/Connections- Web/Volume25-l/7.Hyperlink.pdf

Rousseau, R. & Thelwall, M. (2004). Escher staircases on the world wide web. FirstMonday,9(6). http://www.firstmonday.org/issues/issue9_6/rousseau/index.html

Seidman, S. (1983). Network structure and minimum degree. Social Networks, 5, 269-287.Tang, R. & Thelwall, M. (2003). Disciplinary differences in US academic departmental web

site interlinking, Library & Information Science Research, 25(4), 437-458.Watts, D.J. & Strogatz, S.H. (1998). Collective dynamics of 'small-world' networks.

Nature, 393, 440-442.


Network Visualizations 219

23

NETWORK VISUALIZATIONS

OBJECTIVES

• To introduce some network visualization techniques.• To discuss the suitability of different network visualizations for different purposes.

INTRODUCTION

Visualization is a powerful method for conveying complex information in a clear form. Froma picture of a network it will normally be easier to identify key properties than through awritten description or a list of its connections. There are many techniques for producingvisualizations of networks and many programs available to apply the techniques, includingsome excellent free ones. In fact there are so many options available that the choice itself hasbecome a problem. In this chapter some broad classes of visualization technique areintroduced. See chapter 15 for some examples of academic networks.

NETWORK DIAGRAMS

The simplest kind of network to visualize is one with few nodes, say less than 20, and at mostone link between each pair of nodes. This could be a group of web pages, or a group of websites with multiple links between sites ignored. This kind of small network can be visualizedthrough a simple network diagram, with the main problems being arranging the nodes so thatthe links overlap as little as possible, and labeling the nodes to give a meaningful diagram.See Figure 23.1 for an example. It was produced with Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/), but could have been drawn with any graphics program.

If there can be more than one link between nodes, then a standard way to representthis is by allowing the thickness of the arrow to represent the number of links, as in Figure23.2 (cf., Thelwall, 2001; Thelwall, & Smith, 2002). A threshold can be set to avoid drawingthin arrows. For example, a threshold may mean that arrows with a thickness of less than 10%of the thickest are not drawn.


Figure 23.2 A network diagram with arrow thickness proportional to link counts (Thelwall,2001).


If there are many nodes, say more than 20, or the point of a diagram is to make an objectiverepresentation of the network to test or illustrate a point about its structure, then a network-drawing algorithm can be employed. There are several different such algorithms, all basedupon applying heuristics to position the nodes so that interlinked pairs of nodes tend to beclose together whereas non-interlinked nodes tend to be far apart. This often results in heavilyinterlinked nodes clustering together at the center of a diagram and nodes with few linksbeing near its edge. An algorithm may take into account link frequency and try to put pairs ofnodes with many links between them closer together than those with only a few. An exampleis the Kamada and Kawai (1989) algorithm, as implemented in the Pajek software.

Node labeling is a problem with large networks due to lack of space in which to fit thelabels. One solution offered by some software is to color-code the nodes so that nodes of thesame type have the same color. This could be useful if the purpose of the diagram is tohighlight the differences between a few node types and the names of individual nodes are notrelevant.

LARGE NETWORK DIAGRAMS

A problem of large network diagrams is that with too many links and nodes, information willbe inevitably lost as the nodes and links significantly overlap and obscure each other. If thishappens, the only choice may be to move to a visualization technique that does not draw thelinks, a dimension reduction method, as discussed in the next section. An intermediate stage isto employ an algorithm that attempts to remove the least important links, producing asimplified diagram that hopefully retains the essence of the structure of the original.Pathfinder network scaling is an algorithm for this (Chen, 1999). An extreme variant producesa network with a minimal number of connections, called a maximal spanning tree. It isparticularly suitable for networks where a significant proportion of nodes interlink and thenumber of links between pairs of nodes is significant (cf., Chen, Newman, Newman & Rada,1998).

If the number of links between pairs of nodes in a network is not significant, thenpathfinder network scaling is not appropriate, but there are methods from graph theory thatcan be used. For example, a network diagram could be systematically reduced to its bestconnected nodes by removing the least connected nodes, an option supported in Pajek. Seechapter 15 for an example of this technique.

MULTIDIMENSIONAL SCALING

Multidimensional scaling (MDS) is a statistical 'dimension reduction' technique that can beuse to create two-dimensional pictures or three-dimensional visualizations from complex data(Borg & Groenen, 1999). For networks, MDS can be used to generate link-free graphs byplotting the nodes (i.e., sites or pages) in such a way that nodes which heavily interlink tend tobe close together. MDS is offered by statistical programs as well as by some visualizationsoftware; it is a common technique. Despite its mathematical pedigree, MDS is still aheuristic and its visualizations represent an attempt to create a meaningful picture, but withoutany guarantee of accuracy. The stress values reported by some of the software can be helpfulin assessing how well the data fits a two-dimensional model. High stress values suggest thatthe MDS picture contains a significant distortion of the network.


An important issue with multidimensional scaling is how best to present the link datato the algorithm. A measure of the similarity or dissimilarity of nodes is needed. The logicalchoice is to use link counts as a similarity metric, perhaps normalized (e.g., Musgrove, Binns,Page-Kennedy et al., 2004), but other options can sometimes be better. The same type ofproblem has occurred in another research area, author co-citation analysis (White & McCain,1998), and correlation coefficients were suggested as a solution. Translated to the web, thiswould mean that the similarity of two nodes would be measured by taking the correlationcoefficient of the link count profiles of each of the two nodes with all other nodes. Pairs ofnodes that tend to interlink with the same other nodes, and with the same proportion of theirlinks, would therefore have a high similarity measure. Using this correlation calculationallows nodes that are similar, but different in degree (i.e. the total number of links attached tothem) to be close together in the MDS diagram.

Figure 23.3 illustrates an MDS diagram of web sites, with some regional clusteringmarked. The abbreviations in the figure all refer to different UK universities. Note that thereare no links drawn: university web sites tend to be close together if they have similar inlinkprofiles.

Figure 23.3. An MDS map of UK university web sites (Thelwall, 2002).

SELF-ORGANIZING MAPS

Self-organizing maps are a technique used to plot large quantities of data in an intuitive way(Kohonen, Kaski & Lagus, 2000). They essentially cluster similar documents together into atwo-dimensional map with areas that can be labeled by topic or subject. Self-organizing mapshave been applied to web data, using links between pages to produce a measure of similarity(Faba-Perez, Guerrero-Bote & De Moya-Anegon, 2004). Other researchers have used them to


cluster documents by content similarity rather than link similarity (e.g., Kohonen, Kaski &Lagus, 2000).

KNOWLEDGE DOMAIN VISUALISATION

Knowledge domain visualization (KDViz) is a relatively new research area, a subfield of themore established information visualization. It is a multidisciplinary field that researches theproduction of visualizations for data relating to specific knowledge domains (Borner, Chen &Boyack, 2003; Chen, 2003). A common type of data is citations for a research area or journal.Its visualizations range from two-dimensional and static to three-dimensional and interactive.Interactive visualizations are not suitable for print media, such as the typical journal article,but may be suitable for e-journal articles or to allow end users to directly explore the data.This is a good field to check for new theories and tools that could be useful for the productionof web network visualizations.

SUMMARY

A range of different network visualization techniques have been described. The choice oftechnique for any given data set depends upon several things.

• The number of nodes in the data set. Large networks cannot use simple network diagrams.• Whether there can be multiple links between nodes. Some techniques only work for

networks in which multiple links are allowed. In the web, multiple links can occur if thenodes are web sites and there are multiple links between pairs of web sites.

• The medium in which the visualization will be created. There is more choice if thevisualization does not have to be static, and if it does not have to be small enough to fit ona single book or journal page.

• The research question.

ONLINE RESOURCES

The online component of this chapter contains links to network visualization software andsome instructions for using it in link analysis http://linkanalysis.wlv.ac.uk/23.htm. These arelisted in Table 23.1, and the Pajek web site and software are illustrated in figures 23.4 and23.5.

REFERENCES

Borg, I. & Groenen, P. (1999). Modern Multidimensional Scaling: Theory and Applications.New York: Springer Verlag.

Borner, K., Chen, C. & Boyack, K. (2003). Visualizing knowledge domains. Annual Reviewof Information Science & Technology, 37, 179-255.


Chen, C , Newman, J., Newman, R. and Rada, R. (1998). How did university departmentsinterweave the Web: a study of connectivity and underlying factors. Interacting withcomputers, 10(4), 353-373.

Chen, C. (1999). Visualising semantic spaces and author co-citation networks in digitallibraries. Information Processing and Management, 35(3), 401-420.

Chen, C. (2003). Mapping scientific frontiers: The quest for knowledge visualization. NewYork: Springer Verlag.

Faba-Perez, C , Guerrero-Bote, V.P. & De Moya-Anegon, F. (2004). Methods for analyzingweb citations: A study of web-coupling in a closed environment. LIBRI, 54 (1), 43-53.

Kamada T. & Kawai, S. (1989). An algorithm for drawing general undirected graphs.Information Processing Letters, 31(1), 7—15.

Kohonen, T. Kaski, S. & Lagus, K. (2000). Self organization of a massive documentcollection. IEEE Transactions on Neural Networks, 11(3): 574-585.

Musgrove, P., Binns, R., Page-Kennedy, T., & Thelwall, M. (2004). A method for identifyingclusters in sets of interlinking web spaces, Scientometrics, 58(3), 657-672.

Thelwall, M. & Smith, A.G. (2002). A study of the interlinking between Asia-Pacificuniversity web sites. Scientometrics, 55(3), 363-376.

Thelwall, M. (2001). Exploring the link structure of the web with network diagrams. Journalof Information Science, 27(6), 393-402.

Thelwall, M. (2002). An initial exploration of the link relationship between UK universityweb sites. ASLIB Proceedings, 54(2), 118-126.

White, H.D. & McCain, K.W. (1998). Visualizing a discipline: An author co-citation analysisof information science, 1972-1995. Journal of the American Society for InformationScience, 49(4), 327-355.

Table 23.1. Information available in the online component of this book.Information/UseNetwork visualization tools (online)For instant web site visualizations.Network visualization tools (offline)-software that must be installed before

use.To produce visualizations of link datafrom sets of web sites. Thevisualizations could be of linksbetween web sites or of links withinweb sites.Online web visualizationsWeb sites with web visualizations.

ExamplesTouchGraph.com

http://www.touchgraph.com/Network visualization software Pajek

http://vlado.fmf.uni-lj.si/pub/networks/pajek/and instructions for creating Pajek networksfrom SocSciBot data.

Graphvizhttp://www.research.att.com/sw/tools/graphviz/

Self-Organizing Mapshttp://websom.hut.fi/websom/

Cybergeographyhttp://www.cybergeography.org/


Figure 23.4. The Pajek web site (http://vlado.fmf.uni-lj.si/pub/networks/pajek/).


Figure 23.5. Pajek with a small network, about to apply the Kamada-Kawai algorithm.

Academic Link Indicators 227

24

ACADEMIC LINK INDICATORS

OBJECTIVE

• To describe a range of indicators that may be constructed from collections ofuniversity web sites.

INTRODUCTION

The knowledge that has been built in part III of this book and illustrated to some extent in theSpanish case study chapter and the academic networks chapter can be harnessed to builduseful academic web indicators. For the purposes of this chapter, an indicator is a number, atable of numbers, or a visual representation of quantitative information. Indicators are widelyused by government and industry to help monitor processes that they need to control, orwhich have important outcomes that they need to be aware of. It is logical to design indicatorsfor web publishing because it is an important aspect of research and education. Manycountries and regions produce regular statistics concerning research, development andinnovation. Examples include the US National Science Foundation (NSF) annual Science andEngineering Indicators (www.nsf.gov/sbe/srs/), and the Science and Technology Indicatorsfor the European Research Area (STI-ERA) reports published by the European Union(europa.eu.int/comm/research/era/sti_en.html). The goal of the European indicators can beseen from the following statement on the home page of the responsible unit.

The mission of the unit "Competitiveness, economic analysis, indicators" in [theResearch Directorate General] is to identify relevant data on science and technology,to convert them into meaningful indicators on scientific and technologicalperformances and developments and, on this basis, to provide policy-relevanteconomic analyses for the European Research Area.

(STI-ERA, 2004)

The creation of a web indicators chapter in the STI-ERA reports was a goal of the EU fundedWISER project.


In this chapter, after a brief theoretical introduction to indicator theory, a range ofdifferent types of web indicator will be discussed, building upon section III and the chapter 13case study.

WEB INDICATORS AS PROCESS INDICATORS

In industry, there are typically three different types of indicator. These could be applied on asmall scale to machines or processes, or on a large scale to national economies (Geisler,2000).

Input indicators give information about the inputs to a process of interest. Examplesinclude raw materials, labor time and costs.

Output indicators give information about the outputs of a process. Examples includeproduction volume and profits.

Process indicators give information about the operation of an ongoing process. Thisinformation would typically be used to adjust the process in order to improve itsoperating efficiency or to avoid undesired outputs. Examples include patentapplications and employee sickness.

The creation of university web pages is not an end in itself, when viewed from a large-scalenational or international perspective. Universities serve to educate students, produce researchand engage in technology transfer with industry, amongst other things. The web helpsuniversities to carry out these key goals (<chapter 7; Middleton, McConnell & Davidson,1999). Teaching pages and student information pages are a natural part of education.Research-related pages, including home pages with the primary purpose of publicitygeneration, are now a natural part of research; researchers, departments and universities thatfail to have an effective web presence risk loosing opportunities to maximize the impact oftheir research. Statistics about university web sites are, therefore, process indicators(Scharnhorst, 2004). These can be used to provide information to aid policy makers andmanagers to ensure that web sites are being used effectively as part of the processes ofeducation and research. Web indicators can also be process indicators in a different sense.Some web-based statistics can reveal underlying patterns in scholarly communication forthose interested in the process of research. For instance, the linguistic, geographic and geo-political linking trends discussed in chapters 8 to 10 may reflect underlying patterns ofscholarly communication.

ISSUES OF SIZE AND RELIABILITY

Scale is an important factor in the reliability of statistics. This is recognized in citationanalysis. For example, citations should not be used to compare individual researchers buthave some value for comparing whole departments within a discipline (van Raan, 2000).Individual researchers' citations could be greatly influenced by luck, by the type publicationthey write (e.g., review articles tend to attract many citations; Borgman & Furner, 2002) and


by the size and age of their specialism. The same factors may also influence the citationcounts of all the individual researchers within a single department, but the total citation countfor the department should tend to average out to some extent the influences unrelated toresearch impact. Web links are less reliable as a data source than citations for many reasons,including the following.

• They are not a core part of research publishing.• Individual academics may choose not to publish online, even if they are highly

successful researchers.• Many links seem to be created for relatively minor reasons.• There is no necessity to link to any other page, no matter how closely connected and

informative the potential target page.• Individuals or small teams may run funded projects that have the creation of useful

websites as the end product, rather than journal articles or scientific innovations, say.Such projects are likely to attract many inlinks because of their nature, an amplifiedversion of the citation attracting power of review articles.

The last point if perhaps the most important, especially given the power laws for linkingfound on the web (<chapter 5). It is not reasonable to compare the link attractiveness of twosites, if one has been created as publicity (i.e. part of the process of research), whereas theother is a web portal (i.e. a research output). One way of assessing the reliability of link countdata at different scales is to use correlation tests with a research-related data source. The sameapproach is also used for citations. A strong correlation with research is an indication that thelevel of aggregation is high enough to give useful information. As reported in chapters 8 and10, strong correlations have been found between research ratings and inlinks for universitiesand some departments, but not for all subjects. Since the scale of web publishing variesgreatly from country to country, it is not possible to give a general rule for appropriate scales,but the suggestions below are consistent with, and an extrapolation from, what is known in2004.

• Heavy web publishing countries (e.g. USA, Japan, Australia, New Zealand, Canada,Western Europe, China, Taiwan, South Korea) Analyses of universities, computerscience departments hard science departments, maths departments, and some socialscience departments are likely to be successful.

• Medium web publishing countries (e.g. South America, North Africa, Eastern Europe,Zimbabwe) Analyses of universities and computer science departments are likely to besuccessful.

• Low web publishing countries. No analyses are likely to be successful, althoughinternational linking comparisons between groups of low web publishing countriesmay be successful.

This suggested list would presumably change over time as all countries build significantacademic web presences.


BENCHMARKING INDICATORS

This section is concerned with indicators that are tables of data for sets of departments,universities or countries. As discussed above, there must be a sufficiently large scale of webpublishing to make the indicators meaningful. Table 34.1 illustrates the range of choices ofindicator available. Normally, the units and coverage will be determined by the purpose of thestudy, but the indicator builders may have a choice of whether to select all of the documentmodel and object alternatives, or just to report a selection.

Table 24.1. Possible components of benchmarking indicators.

Tables of aggregated country data can be used for international comparisons: to assess howeach country's web profile compares to those of other countries. For tables of universities ordepartments in a single country, the mathematical linking models can be constructed from thetables and research data. These models can then be used to calculate benchmark figures foreach university. Comparisons can then be made between actual values and benchmark valuesin order to identify universities or departments that are apparently under- or over-performingon the web. Failure to match the benchmark may be due to non-web factors, such as auniversity specializing in low web-using subjects, and so the benchmark values serve thepurpose of flagging potential causes for concern, which must be further investigated. Theinvestigations may have two outcomes: exoneration, where the low figure is not a cause ofconcern; and conviction, where the low figure reveals a genuine problem. This type ofindicator has been termed 'weak benchmarking' (Thelwall, 2004) because the figurescalculated are not binding.

A comprehensive indicator report for a single country may consist of the three types ofindicator above calculated for whole universities and for the departments of each subject thatpublishes on a large enough scale. The report should also include discussions of outliersidentified by the benchmarking, and the identification of causes for concern. It may also giveinternational comparisons with comparable countries, through the same three types ofindicator.

LINK METRICS

Peter Ingwersen's (1988) Web Impact Factors are the original link metrics. Table 24.2 givesdetails of three web metrics that are now in use (cf. Li, 2003). There has been a shift fromcounting average inlinks per page for a site (the original Web Impact Factor) to countingaverage inlinks per faculty member (e.g. Thelwall, 2001). This is because crawler coverageissues mean that counting the number of pages on a site is impossible, and also because sitedesign decisions can have a big impact upon page counts (e.g. a dictionary could be placed

Units Coverage Document model Objects countedUniversities Single country Page Pages per faculty memberDepartments Several countries Directory Inlinks per faculty member

Domain Outlinks per faculty memberUniversirv

. Academic Link Indicators 231

online with one page per definition, giving an enormous site). Nevertheless, for situationswhere faculty numbers are not available, page count denominators are still used.

The measures in Table 24.2 can be applied to individual web sites or even to largerareas such as entire countries. For example, they can be calculated for national academicwebs: all of the university web sites from a single country. They can also be calculated withall of the different ADMs.

The Web Impact Factor measures in some sense the link attractiveness of a site,averaged by page numbers, faculty numbers or research productivity. The model of academiclinking (<chapter 8) suggests that averaging by page numbers or research productivity shouldgive figures that are approximately constant for the universities in a single country. Thismeans that they can be used as weak benchmarking indicators. The same applies for Web UseFactors, which measure the average amount of outlinking from a site. Following the chapter 8link model again, averaging by faculty numbers would give results that correlate with averageresearch productivity per faculty member. These would be problematic to use as weakbenchmarking indicators in countries that have a variety of research orientations in theiruniversities. In fact, this kind of calculation has been suggested for use as a very approximateestimator of university average research quality (Thelwall, Binns, Harries, et al., 2001). But ifresearch productivity figures are difficult to obtain for a country and there is reason to believethat its universities are similar in research capability, then faculty numbers would be apossible WIF or WUF denominator.

Table 24.2. Common academic web metrics.Name

WebImpactFactor

Web UseFactor

Linkpropensity

Applies toWeb site, orother area ofthe web

Web site, orother area ofthe web

Two websites A andB, or twoother areas ofthe web

FormulaMinks to site

Web pages in siteor

Inlinks to siteFaculty numbers

orInlinks to site

Research productivityOutlinks from siteWeb pages in site

orOutlinks from siteFaculty numbers

orOutlinks from site

Research productivityLinks from site A to site B

(Pages in site A) X (Pages in site B)or

Links from site A to site B(Faculty in A) X (Faculty in B)

orLinks from site A to site B

(Research prod. A) X (Research prod. B)

CommentsInlinks can comefrom anyspecified area ofthe web.

Outlinks cantarget anyspecified area ofthe web.


WIFs and WUFs can be calculated relative to different areas of the web. For instance, an .eduWIF would be calculated from link pages in the .edu domain. Restricting the calculation todifferent areas of the web allows comparisons to be made between different types of impact,such as educational, national, international and commercial (Thelwall, 2002). The same isalso true for the WUF. Recall that care must be taken with interpreting the results ofcalculations using top level domains (TLDs) such as .edu, .com and .uk, however, becausethey should not be interpreted at face value. For example, although the .com domain name issupposed to be for commercial use, any person can buy a .com domain name and in practice itcontains an enormous variety of different types of information. Even the .edu domain is notrestricted to American universities (e.g. www.london.edu).

RELATIONAL INDICATORS

Relational indicators will typically be diagrams illustrating link-based relationships betweencountries, universities or departments. Multivariate statistics such as cluster analysis (Gordon1999), multidimensional scaling (Kruskal & Wish, 1978) and factor analysis (Tabachnick &Fidell, 2001) may also be used for relationship identification. Network drawing techniques aredescribed in chapter 23 and some network indicators can be found in chapters 9 and 13.

Relational indicators can be used to highlight geographic, linguistic or geo-politicaltrends. The data used is summarized in Table 24.3, where normalized interlinks refer to linkcounts divided by source and target unit research productivity (faculty X research rating), alsoknown as normalized propensity to link (Smith & Thelwall, 2002). Raw interlink counts areuseful to gain an initial overall impression of link flows, whereas normalized interlink countscan reveal trends underlying the links. Relational indicators work best if the entities comparedare similar in size: great dissimilarity can lead to meaningless comparisons because linkingcan be scale-dependant to some extent (Thelwall & Smith, 2002).

Table 24.3. Possible components of relational indicators.UnitsUniversitiesDepartments

CoverageSingle countrySeveral countries

Document modelPageDirectoryDomainUniversity

Objects countedInterlinksNormalized interlinks

OTHER METRICS

One other link indicator has been used in addition to those in Table 24.2, the WebConnectivity Factor (WCF), which is a potential measure of how well interconnected a website is (Thelwall, 2003). For two web sites A and B, the link connectivity measure is thenumber of links from A to B or the number of links from B to A, whichever is the smaller.For a web site A, the WCF is the total of its link connectivities with all other web sites B,divided by the page count of A, the research productivity of A, or the number of facultymembers in A. This measure is potentially more reliable than the WIF and WUF sinceunidirectional anomalies should not have an impact. If there is a high number of links from


site A to site B then the connectivity between A and B will be the number of links from site Bto site A, assuming that this is lower and a 'normal' value.

A set of academic web indicators would not be complete if they concentrated onlyupon link indicators. The following are suggestions for other types of indicators that areappropriate for a report on academic web presence.

• Web site sizes (pages, directories, domains).• Coverage of the web site in the major search engines.• Web site usability measures (Nielsen, 2000; www.useit.com).• Information content measures (e.g. provision of certain types of information such

as full online prospectuses, full details of staff publications).

There have been several academic-related web investigations that involve different types ofindicators, which give ideas about the potential for expanding academic link analysis to awider environment. For example, Cui (1999) used inlinks and Hernandez-Borges, Macias-Cervi, Gaspar-Guardado et al. (1999) used a range of statistics, including inlinks, as potentialquality indicators for medical web sites.

SUMMARY

Once basic link data has been gathered from the university web sites of a country or of severalcountries, there is a wide range of different types of benchmarking and relational indicatorthat can be calculated. These may include university-level indicators and departmental-levelindicators, in addition to international comparisons. These are all process indicators in thesense of providing information that may be used to take corrective action in research policy orweb site management. They may also be used to support a theoretical analysis of scholarlycommunication, online linking or web publishing. Outcomes may include the identification ofindividual underperforming nations, universities or departments, or the identification of(possibly undesired) online connectivity patterns, such as linguistic commonalities.

FURTHER READING

Aguillo (1998) gives a good introduction to the potential for web-based academic indicatorsin his farsighted early article. Two Spanish-language cybermetrics books available arerelevant for indicator building (Alonso Berrocal, Figueroa & Zazo, 2004; Faba Perez,Guerrero Bote, & de Moya Anegon, 2004). Li (2003) has published a review of issues relatingto Ingwersen's Web Impact Factor. Examples of web indicators can be seen in the outputs ofvarious projects including EICSTES (European Indicators, Cyberspace and the Science-Technology-Economy System, www.eicstes.org, 2000-2003), WISER (Web Indicators forScience, Technology & Innovation Research, www.wiserweb.org, 2002-2005), SIBIS(Statistical Indicators Benchmarking the Information Society, www.sibis-eu.org, 2001-2003)and OCLC (Online Computer Library Center, www.oclc.org, 1967-) in the USA.


REFERENCES

Aguillo, I. F. (1998). STM information on the Web and the development of new InternetR&D databases and indicators. Online Information 98: Proceedings, 239-243.

Alonso Berrocal, J., Figueroa, C. & Zazo, A. (2004). Cibermetrfa: nuevas tecnicas de estudioaplicables al web. Gijon, Spain: Trea.


Cui, L. (1999). Rating health web sites using the principles of citation analysis: a bibliometricapproach. Journal of Medical Internet Research, 1(1), e4. Available:http://www.jmir.Org/l 999/1 /e4/index.htm

Faba Perez, C, Guerrero Bote, V. P. & de Moya Anegon, F. (2004). Fundamentos y tecnicascibermetricas. Available:http://www.juntaex.es/consejerias/ect/dgsi/Documentacion/tecnicascibermetricas.pdf

Geisler, E. (2000). The metrics of science and technology. London, UK: Quorum Books.Gordon, A.D. (1999). Classification 2nd Ed. London: Chapman and Hall.Hernandez-Borges, A.A., Macias-Cervi P., Gaspar Guardado M.A., Torres-Alvarez de

Arcaya, M.L., Ruiz-Rabaza, A. & Jimenez-Sosa, A. (1999). Can examination of WWWusage statistics and other indirect quality indicators help to distinguish the relativequality of medical web sites? Journal of Medical Internet Research, 1(1), el. Available:http://www.jmir.Org/1999/l/el/index.htm

Ingwersen, P. (1998). The calculation of Web Impact Factors. Journal of Documentation, 54,236-243.

Li, X. (2003). A review of the development and application of the Web Impact Factor. OnlineInformation Review, 27(6), 407-417.

Kruskal, J.B. & Wish, M. (1978). Multidimensional Scaling. Beverly Hills: Sage.Middleton, I., McConnell, M. & Davidson, G. (1999). Presenting a model for the structure

and content of a university World Wide Web site, Journal of Information Science,25(3), 219-227. Available: http://www.abdn.ac.uk/~coml34/publications/jisl999.shtml

Nielsen, J. (2000). Designing web usability. Indianapolis, IN: New Riders.Scharnhorst, A. (2004). Personal communication.Smith, A.G. & Thelwall, M. (2002). Web Impact Factors for Australasian universities,

Scientometrics, 54(3), 363-380.STI-ERA (2004). European Research Area - Science and Technology Indicators. Accessed

June 3, 2004. Available: http://europa.eu.int/comm/research/era/sti_en.htmlTabachnick, B. & Fidell, L.S. (2001). Using multivariate statistics. Boston: Allyn and Bacon.Thelwall, M. Binns, R. Harries, G. Page-Kennedy, T. Price E., & Wilkinson, D. (2001).

Custom interfaces for advanced queries in search engines. ASLIB Proceedings, 53(10),413-422.

Thelwall, M. & Smith, A.G. (2002). A study of the interlinking between Asia-Pacificuniversity Web sites. Scientometrics, 55(3), 363-376.

Thelwall, M. (2001). Extracting macroscopic information from web links. Journal of theAmerican Society for Information Science and Technology, 52 (13), 1157-1168.

Thelwall, M. (2002). A comparison of sources of links for academic Web Impact Factorcalculations. Journal of Documentation, 58(1), 60-72.


Thelwall, M. (2003). Web use and peer interconnectivity metrics for academic web sites.Journal of Information Science, 29(1), 11-20.

Thelwall, M. (2004). Weak benchmarking indicators for formative and semi-evaluativeassessment of research. Research Evaluation, 13(1), 63-68.

van Raan, A.F.J. (2000). The Pandora's box of citation analysis: Measuring scientificexcellence - the last evil? In: Cronin, B. and Atkins, H.B. (Eds.). The web of knowledge:a festschrift in honor of Eugene Garfield. Metford, NJ: Information Today Inc. ASISMonograph Series, 301-319.


Summary 237

VI SUMMARY

25

SUMMARY

OBJECTIVES

• To summarize the most important points in the book.• To contrast the link analysis approach described in this book with other approaches.• To discuss the future of link analysis.

INTRODUCTION

This book attempts to describe an information science approach for link analysis in a morecoherent and complete way than has been possible in previously published articles. If itenables some people that are new to link analysis to employ its approach, then the book willhave been successful. In addition to the general description of information science linkanalysis (<chapters 1 to 4), the background information about web structures (<chapters 5 and6) and the detailed analysis of academic web spaces (<chapters 7 to 11) aim to equipresearchers with the intuition to devise new experiments and the skills to analyze their results.The case studies in part IV may be of use to different audiences, for which some of them maybe more relevant others. Finally, the tools and techniques sections in conjunction with thebook's web site, software and data is designed to facilitate researchers in the practicalities ofdata collection and processing.

The information science approach to link analysis is characterized by a concern tovalidate link counts as a data source from which to draw justifiable conclusions. This is incontrast to typical computer science approaches for which validity is not an issue; instead theimportant factor is whether links can be used to improve the results information retrieval, web


mining or other algorithms (e.g., PageRank, HITS). Below is the overview of the informationscience approach to link analysis reiterated from chapter 1.

An information science approach to link analysis10) Formulate an appropriate research question, taking into account

existing knowledge of web structure (<chapters 5, 6, and chapters 7-16as appropriate).

11) Conduct a pilot study (<chapter 4).12) Identify web pages or sites that are appropriate to address a research

question.13) Collect link data from a commercial search engine or a personal

crawler, taking appropriate safeguards to ensure that the resultsobtained are accurate (<chapter 17 or 18).

14) Apply data cleansing techniques to the links, if possible, and select anappropriate counting method (<chapters 3 and 19).

15) Partially validate the link count results through correlation tests(<chapter 4).

16) Partially validate the interpretation of the results through a linkclassification exercise (<chapter 4).

17) Report results with an interpretation consistent with link classificationexercise, including either a detailed description of the classification orexemplars to illustrate the categories (<chapter 4).

18) Report the limitations of the study and parameters used in datacollection and processing (stages 3 to 5) (<chapters 3, 4).

Despite the importance of the information science approach, most of the link analysisapplications in part IV do not use it. With the exception of chapter 14, they do not tackle theissue of validity but take a less scholarly approach. In some cases this is because a full-scalevalidity study is inappropriate for the methods used (<chapters 12, 16) and in others becausethe purpose of the case study is to demonstrate some techniques (<chapters 13, 15). Thisvariety reflects the fact that there are different useful approaches for extracting informationfrom links, depending upon the objective of any given exercise.

INFORMATION SCIENCE CONTRIBUTIONS TO LINKANALYSIS

Researchers in many different subject areas have already adopted versions of link analysis toaddress their own research questions. Information science is a discipline with a primary focuson information rather than any particular application of that information to solve a researchquestion. For example, a sociologist may use link analysis as a tool to achieve the end offinding out more about cyberactivism. An information scientist may also use link analysis toinvestigate a topic, but with the objective of developing, testing or illustrating a methodology.She may also conduct wider studies in order to derive more general results about the overall

Summary 239

level of link-related information in the web. This can provide background information for usein subsequent research focused upon specific information uses rather than the informationitself. Other information scientists investigate other aspects of information, including its usein different situations. Nevertheless, as the subject with an information focus, informationscience has the responsibility to develop and explore information-centered methodologies foruse by researchers in other subjects, as well as to investigate the information itself in differentsettings.

The information science contribution to link analysis, then, is the development ofmethods for exploitation by others and the conducting of large scale studies that can providebackground information for future users of the methods and analyzers of the information. Forinstance, the academic link analysis results reported in part III would provide useful contextif, say, chemists wished to investigate linking practice within their discipline in Canada orsociologists wished to understand how links are used in female academics' personal homepages in Texan universities. Information scientists can and should conduct larger-scale studiesthan would be useful for other social science purposes because of this need to provide contextinformation (e.g., linking models, link type distributions) for other subjects. Informationscience is in the middle of an information provision sandwich because computer scienceprovides useful link-structure information from even larger-scale and more abstractinvestigations (<chapters 5, 6). Figure 25.1 summarizes the hypothesized link analysisinformation flow.

Figure 25.1. Information flow in link analysis.

OTHER LINK ANALYSIS APPROACHES

The link analysis methodology described in this book is not the only possible informationscience approach. Other information scientists use link analysis in different ways, andresearchers outside of information science use link analysis methods that have an informationscience facet. Most of these are reviewed or mentioned in this book.

Other information science link analysis approaches


• Analysis of paths in the web and small world properties• Longitudinal analysis of links, link permanence• Analysis of links and their relationship to commercial search engine coverage and ranking

Computer science link analysis approaches• Longitudinal analysis of links (e.g. in web dynamics)• Link-content relationships• Links and information retrieval (e.g. designing effective search engine algorithms)• Web structure mining• Web mining• Web modeling

Social science and sociology link analysis approaches• Web sphere analysis• Virtual ethnography• Hyperlink network analysis

FUTURE DIRECTIONS

Information science link analysis has now reached a degree of maturity, with a developedmethodology and associated tools. An implication of this is that the nature of the field mustchange to reflect the new level of development. No future change is entirely predictable, butthe following seem to be likely outcomes.

• Non-information scientists can now draw upon existing methodologies with which toconduct link analyses and can focus upon their subject-specific research questions ratherthan methodological development.

• Information scientists can share in this exploitation or can choose instead to refine orcritically analyze existing methods, or turn to new ones.

• Link analysis can be transferred to the classroom, a move that has already taken place inmany universities, within bibliometrics, link analysis or web courses. In this context, theadvantages of link analysis for instructional purposes include its free data source, the web,and the almost infinite variety of potential student investigations.

Perhaps link analysis in one form or another will become an increasingly common skill forfuture graduates and researchers, to complement and support a sophisticated use of whicheversearch portal is dominant at the time.

Summary 241

26

GLOSSARY

• AHTheWeb. A commercial search engine, bought by Yahoo! in 2004, and previously usedin a number of link analysis investigations because of its powerful link search facilities.

• AltaVista. A commercial search engine, bought by Yahoo! in 2004, very popular towardsthe end of the last century, and previously used in a number of link analysis investigationsbecause of its powerful link search facilities.

• Alternative Document Model (ADM) . A method of aggregating web content into unitsfor counting purposes. See the directory ADM, domain ADM, site ADM and page ADMdefinitions.

• Citation. A reference by one publication of another. A citation is the reference viewedfrom the perspective of the referenced document.

• Co-linked. Two pages that both have inlinks from a third page are co-linked.• Co-linking. Two pages that both have outlinks to a third page are co-linking. Sometimes

also described as bibliometric coupling or just coupling.• Cybermetrics. The application of quantitative techniques to the Internet, influenced by

informetrics.• Directory ADM. All the files in the same directory are treated as a single document.

Directories are equated with the position of slashes in URLs, rather than by the actualdirectory/folder structure of pages on the hosting web server.

• Domain name. The part of an URL of a web page normally following the http:// andpreceding the first subsequent slash (if any). Note that this is a simplified definition andthere is a longer computer science definition that encompasses additional variations(Berners-Lee, Fielding & Masinter, 1998).

• Domain ADM. All files with the same domain name are treated as a single document.• File Transfer Protocol (FTP)• HITS. Hyperlink Induced Topic Search. An algorithm designed to use link structures to

find the web pages most relevant to a given topic (<chapter 12).• Host. Used to refer to an individual computer such as a web server.• Hyperlink. A feature in a web page that allows users to click to navigate to a different web

page. Hyperlinks are also called links and clickable links. They can also be found inhypertext environments other than the web.

• Hyperlink Network Analysis (HNA). Hyperlink network analysis is the application ofsocial network analysis methods to the web


• HyperText Markup Language (HTML). The coding language in which web pages aredescribed. This is interpreted by web browsers to produce the web pages that web userssee, and is processed by web crawlers to extract the embedded links.

• HyperText Transfer Protocol (HTTP). The mechanism used by programs such as webbrowsers and crawlers to communicate with a web server, for example to request a webpage.

• Indicator. An indicator is a number, a table of numbers, or a visual representation ofquantitative information. Note that wider definitions of indicators are sometimes used,encompassing the presentation of non-quantitative information.

• Inlink. A link to a web page. If qualified by a web unit, this implies that the link shouldoriginate outside of the specified unit. For example a site inlink is a link to any page in asite from any page in a different site. Similarly, a page inlink is a link to a page from adifferent page. Inlink is synonymous with 'backlink' and inlinked is synonymous with'linked to'.

• Interlink. Normally a link between two different web sites, also referred to as an inter-sitelink. This is commonly used with the -ing form of the word. For example, web siteinterlinking refers to links between web sites (i.e., site inlinks/site outlinks).

• Internet. A large public network of computers running IP and able to communicate witheach other.

• Internet Protocol (IP). The basic mechanism for transferring information over the Internet.• IP address. A dot-separated list of numbers that identifies computers on the Internet,

including web servers.• Link page. A web page containing a link. This terminology is sometimes used instead of

link because search engines count link pages rather than links in response to a link-basedquery.

• Outlink. A link from a web page. If qualified by a web unit, this implies that the linkshould target a page outside of the specified unit. For example, a site outlink is a link fromany page in a site to any page in a different site. Similarly, a page outlink is a link from apage to a different page.

• Page ADM. Each separate file is treated as a document for extracting links, or for othercounting purposes.

• PageRank. An algorithm used by Google to rank web pages using the link structure of theweb (<chapter 12).

• Pajek. A program for network visualization.• Path. A path between two nodes in a network is a contiguous chain of links, starting at the

first and ending at the last.• Portable Document Format (PDF). A document format created by Adobe and commonly

used for posting documents on the web.• Power law. A mathematical law that has been applied to many kinds of web data. It is

related to rich-get-richer phenomena, and is also known as Lotka's law. See chapter 5 fora definition and discussion.

• Publicly indexable pages: the set of pages in a web site that can be found by a crawler byfollowing links from the home page (and obeying ethical issues for crawling).

• Search engine. A program that allows users to type in an information request, such as akeyword query, and returns lists of web pages matching the query.

Summary 243

• Selflink. A link from a web page to the same page, perhaps to a different part of the page.If qualified by a web unit, this implies that the link should target a page inside of thespecified unit. For example a site selflink is a link from any page in a site to any page inthe same site. Site selflink is synonymous with 'internal site link', or sometimes just'internal link'.

• Site ADM. All files belonging to a clearly defined web site are treated as a singledocument.

• Shortest path. A shortest path between two nodes is a path between them that has theminimum possible length.

• Social Network Analysis (SNA). Social network analysis is a methodology that hasevolved to study social groupings, particularly in terms of social and communicationconnections within a group.

• SocSciBot. A web crawler available with this book and designed for research crawling.• SocSciBot Tools. A suite of programs that can be used to analyze the link structure files

produced by SocSciBot and those available in the cybermetrics university link structuredatabases.

• TLD spectral analysis. A technique for choosing and ADM to use for a data set (<chapter19).

• Top level domain (TLD). The final segment of a domain name. This will either be ageneric top level domain, such as .edu, .com and .info, or a country-specific domain, suchas .uk for the UK or .es for Spain.

• UCINET. A program for social network analysis calculations.• University ADM. All files belonging to a university are treated as a single document.• Web. The collection of resources that can be obtained over the public Internet using

HTTP.• Web crawler, robot, bot. A program that visits web pages, automatically extracts their

links and follows them.• Web site. An entity without a single agreed definition. Loosely speaking, any collection of

pages that have a consistent organizational, structural or visual theme may be thought ofas a web site. Normally, web sites seem to have identifiable regularities in URLs, such asa common domain name, or a common directory on the web server.

• Webometrics. The application of quantitative techniques to the web, influenced byinformetrics.

• Yahoo!. A commercial search engine and web directory service.

REFERENCES

Berners-Lee, T., Fielding, R. & Masinter, L. (1998). Uniform Resource Identifiers (URI):Generic Syntax and Semantics, RFC 2396, August 1998. Available:http://www.ietf.org/rfc/rfc2396.txt


Summary 245

APPENDIX

A SOCSCIBOT TUTORIAL

TUTORIAL

The objective of this tutorial is to give a walkthrough of the key capabilities of SocSciBot andits associated programs, SocSciBot Tools and Cyclist. It can be read away from a computer togain a general impression of the capabilities of the software, or used directly. Note that as thesoftware is updated and enhanced the tutorial will become out of date in various differentways, but its capabilities will remain fundamentally the same.

Step 1: Installing SocSciBot, SocSciBot Tools and Cyclist

The three programs can be downloaded and installed from the SocSciBot web site.

1. Go to the SocSciBot web site http://socscibot.wlv.ac.uk/ and follow the link to "downloadall three programs in one file". When prompted by your computer, choose a place to savethe programs to where you have plenty of storage space to save data. This will typicallybe your computer's hard drive, e.g. the C: drive.


2. Next, unzip the file SocSciBotAll.zip from the place where you saved it. This will createseveral new files: the programs SocSciBot, SocSciBot Tools, and Cyclist.

Summary 247

Step 2: Installing Pajek

If you wish to produce network diagrams with SocSciBot data then you are recommended toinstall the network drawing software Pajek. Please do this before starting SocSciBot for thefirst time because SocSciBot looks for Pajek when it is first started and will not find Pajek ifPajek is installed after SocSciBot is first run.

1. Go to the Pajek home page http://vlado.fmf.uni-li.si/pub/networks/paiek/ and downloadand install the latest version of Pajek.

Step 3: Crawling a first site with SocSciBot

When SocSciBot is first started, it will ask some questions about where to store the data thatis collects when it crawls web sites, and for a 'project' name to give each collection of one ormore web sites crawled.

1. Start up SocSciBot by double clicking on the file called either SocSciBot orSocSciBot.exe where you unzipped it to on your computer. This should produce thefollowing dialog box.


2. Confirm that the folder chosen by SocSciBot to store your data is acceptable by clickingOK, and answer any questions about the location of Microsoft Excel and Pajek.

3. Enter small test as the name of the project at the bottom of the next dialog box, WizardStep 1, and then click on the start new project button. All crawls are grouped together intoprojects. This allows you to have different named groups of crawls and to analyze

Summary 249

4. Click No to answer the strange question (below) that you are asked next. This is anadvanced data cleansing facility that you are unlikely to need before you become anexpert user.

5. Click No to answer the second strange question (below) that you are asked next. This isanother advanced facility that you are unlikely to need before you become an expert user.

6. In the Wizard Step 2 dialog box, enter h t t p : / / l i n k a n a l y s i s . wlv. a c . uk / as thestarting URL of the site to crawl, and then click Start a new crawl of this site.


7. A new screen will now appear, giving a lot of information. None of this needs to bealtered - it is mainly for advanced features of SocSciBot. The crawl is ready to go. Clickthe Crawl Site button on the new screen.

Summary 251

Information about the crawl can be read in the title bar at the top of the screen during thecrawl and also at the end of the crawl. The title bar will report the URL of each page as it iscrawled, as well as the total number of URLs in the list to be crawled and the number ofURLs that have already been crawled. After half a minute or so the crawl should end and thescreen below will be displayed. The numbers in the bar at the top of the screen may bedifferent if the web site has changed.


8. Click Yes to shut down SocSciBot when the crawl is complete. You have now crawled allpages on the http://linkanalysis.wlv.ac.uk site. Before doing some simple link analyses,two more sites will be crawled in the next step.

Step 4: Crawling two more sites with SocSciBot

1. Start up SocSciBot again by double clicking on the SocSciBot or SocSciBot.exe filewhere you unzipped it on your computer. This should take you straight through to WizardStep 1. Click on small test to add another crawl to this project.

Summary 253

o start a new pro|ect, entei itsname here and then click on the'Start new project1 button

Cancel wizard and go straight to crawl page

2. Enter h t t p : / /cybermetr ics . wlv. ac . uk/ as the URL of the second site to crawl,and click Start a new crawl of this site.

3. Click on the Crawl Site button on the next screen and wait for the crawler to finish.4. Click Yes to end the crawl.5. Repeat steps 1 to 4 for the URL h t t p : / / s o c s c i b o t .wlv . ac .uk/

Three web sites have now been crawled and are ready to be analyzed by SoeSeiBot Tools.

Step 5: Viewing basic reports about the "small test" project with SoeSeiBot Tools

1. Start up SoeSeiBot Tools by double clicking on the SoeSeiBot Tools or SoeSeiBotTools.exe file where you unzipped it on your computer. This should take you straightthrough to the Project Selection Wizard. Click on small test to select this project toanalyze.


2. Select Use this project in the following dialog box.3. Answer Yes to the question about whether you would like a set of basic reports.

Summary 255

4. After a few seconds, the reports will have been calculated and you can view them usingthe drop down menu in the middle of the screen (see above). Click on All external links (atthe top of the list). More information will be displayed about it on the right of the screen(see below). Then click on View report to see a list of URLs targeting pages outside ofeach site (site outlinks). Try the same with all of the reports and try to work out what theycontain. Notice that full URLs are not normally given, initial http:// and www are choppedoff to save space. If you have Excel on your computer, you will sometimes get extrabuttons that will allow you to view the reports in Excel. These reports should contain thelink information needed for most link analysis investigations.


.ralph-abraham.org/vita/redwood/Vienna.htmlbiblio-fr.info.unicaen.fr/bnum/jelec/Solaris/d02/2bossy.htmlSherlock.berkeley.edu/asis96/asis96.html.ascusc.org/jcmc/vol8/issue4/park.html.cindoc .csic .es/cybermetrics/articles/vlilpl .htmlwww2002.org/CDROM/refereed/338/.robotstxt.org/wc/norobots.htmlinformationr.net/ir/9-3/paperl8 0.html.google.com/technology/

Above is a selection of the links in the All external links report. Note the removal of the initialhttp:// and http://www from these URLs (a simple space-saving measure).

5. A key report is ADM count summary so click on this one and then click on the View inExcel button if you have it, (otherwise click on the view report button). This shows thecount of links to each site from all the other sites in the project, and the count of linksfrom this site to the other sites in the project. These numbers are reported for each of fourADMs. Most people will only need the file ADM, (i.e. standard link counting) which isthe f-to column and the f-from column. For example, reading these two columns for thelinkanalysis.wlv.ac.uk row, there are two links to linkanalysis.wlv.ac.uk from the othertwo sites, but five links from linkanalysis.wlv.ac.uk to the other two sites.

Summary 257

Step 6: Viewing a network diagram with Pajek

If you have installed Pajek on your system, you can view network diagrams of links betweensites or links within sites, based upon data from SocSciBot Tools.

1. Use the drop-down box in the middle of the screen to select the option Pajek matrix forthe whole project (with current options). If you are asked if you want to calculate the file,click Yes.


2. Click on the report single.combined.full to view it in Pajek. This contains link informationcombined from all the crawls into a single network, including links between sites, butexcluding links within sites, and excluding all links to sites that have not been crawled.(Choosing other options can select different collections of links to include in thenetworks.)

Summary 259

3. Data for the network should now be loaded into Pajek (see above). To view the network,select Draw from the Draw menu in Pajek (see below).

4. If the network does not have labels (site domain names), select the Options menu, MarkVertices Using, and Labels (or just Control-L). This should give a network of the inter-sitelinks.


5. To get an imprcved layout of the network diagram, try selecting the Kamada-Kawaipositioning algorithm by selecting Layout, Energy, Kamada-Kawai, Free and thenviewing the result.

Summary 261

Step 7: Viewing site diagrams with Pajek

1. If you would like to see a diagram of each individual site, rather than the inter-siteconnections, this is also possible. If you select Pajek matrices for each individual site(with current options) from the drop-down menu, you will not quite get this, because thedefault for SocSciBot Tools is to ignore all internal site links. SocSciBot Tools needs tobe informed that you want the internal site links and not any other type of link, which willgive a diagram of the internal structure of a web site. Select Options and Subproject andADM selection wizard from the File menu, and select just the site self-links option.


2. Now select Pajek matrices for each individual site (with current options) from the drop-down menu and view the files, by clicking on them. You should get individual sitenetworks in Pajek. Below are two of the networks, redrawn with the Kamada-Kawaialgorithm. The second one has so many lines that it is hard to interpret, even if expandedto make the labels legible.

Summary 263

Note that for large networks, reduced diagrams can be obtained by choosing the directory ordomain ADM in SocSCiBot Tools instead of the file ADM.

Step 8: Using Cyclist

Cyclist is a text search engine, not a link analysis program. It will not be necessary forstandard link analyses but it provided in case it is useful.

1. Start up SocSciBot by double clicking on the file called either Cyclist or Cyclist.exewhere you unzipped it to on your computer.

2. Answer the questions and after 20 seconds or so of calculations you will get a standardsearch engine type interface. Try searching for a common word, like "link" and thenclicking on the results in the right hand side to see what information you are given. In theexample below, 41 pages in the project contain the word link (or links) and the first 10 arelisted, with some extra information about them. Cicking on various parts of the screen willopen each web page in Internet Explorer, or in Notepad, or will list the words in the pageafter the word searched for.


SUMMARY

This tutorial has illustrated the basic, most generally useful facilities available in SocSciBot,SocSciBot Tools and Cyclist. There are many advanced options that can be discoveredthrough the online documentation and by experimentation.

INDEX

Summary 265

Abraham, R.H., 2academic web, 54, 59, 62, 63, 69, 85, 86, 101, 137,

163Active Server Pages, 176Adams. J., 82ADMs. See Alternative Document Modelsadvanced features, search engines, 18, 95, 148Aguillo, I., 2, 137, 233Albert, R., 49, 51,150AllTheWeb, 74, 182,241Almind, T.C., 2Alonso Berrocal, J., 233AltaVista, 14, 52, 53, 54, 95, 97, 111, 148, 149,

151, 152, 153, 182,241Alternative Document Models, 27, 31,47, 63, 82,

88, 138, 141, 149, 150, 152, 163, 241Amazon.com, 16anchor text, 62Andersen, J., 147Arasu, A., 21,66Arnold, J., 147Arroyo, N., 20, 189ASP. See Microsoft's Active Server Pagesautomatically generated pages, 15, 193Ayres, M., 146Baeza-Yates, R., 13, 56, 121, 132, 133, 134Bailey, J.P., 36Baldi, P., 56Barabasi, A.L., 49, 51, 56, 150Bar-Ilan, 1, 43, 72, 73, 75, 76, 77, 78, 146, 184,

186Barnett, G., 9, 36Bates, M.J., 147Bazerman, C , 146Bennett, C , 146Berg, C.A., 146Bharat, K., 33, 120bias, 19, 96, 151bibliometrics, 3, 35, 146, 148Binns, R., 98, 222, 231Bird,J.E., 115Bjorneborn, L., 2, 6, 33, 53, 56, 61, 67, 214, 215,

216Borg,L, 221Borgatti, S.P.,216Borgman, C , 3, 6, 70, 71, 72, 78, 229Borner, K., 171,223Bossy, M.J., 2bot. See web crawlersBoyack, K., 171,223Braun, T., 93Brin, S., 14, 21, 23, 24, 25, 36, 62, 120, 122, 124Broder, A., 12, 13, 14, 32, 52, 53

Brody, T., 115Brookes, T., 23Burnett, R., 146,212Callahan, E., 110, 116Carr.L., 114, 115Casserly, M.F., 115Castillo, C , 13,56, 132Chakrabarti, S., 16, 21, 60, 61, 67Chang, B., 33Chappell, R., 159Chen, C , 89, 171,221,223Cho.J., 21,66Chu, H., 105Chubin, D., 78citation analysis, 3, 35, 69, 70, 72, 81, 93, 101, 102,

109, 111classification, 38, 39, 64, 65, 66, 76, 77, 94, 103,

151,155,156clique, 214clustering, 59, 61, 151clustering coefficients, 215co-linking, 6, 66,241commercial web sites, 173, 176, 177, 178competitive intelligence, 173, 174computer science perspective, 1, 47, 50, 59Conchar, M., 147confidence limits, 40content analysis, 38, 59, 61, 64, 72, 120content crawlers, 11, 12, 13, 14correlation testing, 41, 42, 82, 104, 149, 153, 158Cothey, V., 11,63, 113, 190crawl parameters, 10crawler. See web crawlersCronin, B., 6, 36, 78, 116Cui, L., 233cyclist, 196, 198

data cleansing, 199, 200, 201, 202Davenport, E., 36Davidson, G., 70, 71, 228de Beaver, D., 93de Moya-Anegon, F., 222de Vries, R., 114Deep web. See automatically generated pagesDeerwester, S., 120degree centrality, 214degrees of separation, 49departmental web sites, 75, 101, 102, 105Diamond, N., 72digital libraries, 114Dillon, A., 147directed graphs, 47, 96disciplinary differences, 63, 70, 103, 158


disciplinary interlinking, 60, 64, 101, 103, 104,154, 170, 171

DMoz. See Open Directory ProjectDomingos, P., 132Dominick, J.R., 147, 148, 149Dougherty, M., 146, 209, 210, 212Dumais, S.T., 120duplicate pages, 11, 14, 25, 26Egghe, L., 6EICSTES, 99, 233Escher staircase, 214ethical issues, 17Everett, M.G., 216Faba-Perez, C , 222, 233Fairclough, R., 103, 104Faraj, S., 36FAST, 74feasibility study, 37Fernandez, M.T., 93Fidell, L., 42, 200Figueroa, C , 233filtering, 26

Flake, G., 51,56,62, 150Foot, K., 146, 209, 210, 212Ford, C , 109, 111,112, 116Frasconi, P., 56Fredrick, C , 146Freeman, L., 214, 216Friedman, M., 146Furnas, G.W., 120Furner, J., 3, 6, 70, 72, 78, 229Garcia-Molina, H., 21,66Garfield, E., 109,110Garrett, N.A., 147Garrido, M., 10, 36, 146, 216Gaspar-Guardado, M.A., 36, 233Geisler, E., 228Geissler, G., 147Gellmann, J.S., 116geography, 88, 89, 106Georghiou, L., 93Gibbons, M., 99, 145Giles, C.L., 51,52, 115, )73, 184Glanzel, W., 93Glover, E., 51,62Gomez, I., 93Goodeve, C , 43Goodrum, A.A., 115

Google, 1, 16,23,24,25, 175, 176, 179, 182, 191Gordon, A.D., 232Graham, H.D., 72Greenhalgh, C , 110Griffith, B.C., 101Groenen, P., 221Guerrero-Bote, V.P., 222, 233Gupta, A., 147

Gushrowski, B.A., 147Halavais, A., 10, 36, 146, 216Hammond, N., 146Harnad, S., 114, 115Harries, G., 33, 38, 39, 43, 63, 64, 73, 74, 76, 78,

81, 82, 83, 84, 85, 103, 104, 145, 147, 150, 151,156, 158, 159,231

Harter,S., 109, 111, 112, 116Hartz, J., 159Harvest-NG, 190Haug, G.,71Hayes, B., 56Hearit, K.M., 146Henzinger, M., 33Hernandez-Borges, A.A., 36, 233Herring, S.D., 113, 146Heydon, A., 14Hine, C , 146,209,211,212HITS, 121, 127, 128, 129, 130, 131, 132HNA. See hyperlink network analysisHowell, D.C., 37, 150Huberman, B.A., 56, 150Hyland, K., 71hyperlink, 5, 113hyperlink network analysis, 213Hysen, K., I l l , 112indexing, 174information science perspective, 1, 2, 3, 23, 24, 35,

70, 109, 238Ingwersen, P., 2, 6, 24, 25, 35, 70, 87, 95, 109, 147,

230, 233Mink, 5, 6, 51,81,86, 87, 101, 111,242Institute for Scientific Information, 110interlink, 5,242international links, 93, 94, 95, 96, 97, 99, 106, 137,

138, 139, 141, 142, 143, 144Internet Archive, 20, 31, 115, 181,184Internet Service Providers, 148, 149, 150, 151Invisible Web. See automatically generated pagesISP. See Internet Service ProvidersJankowski, N., 212Java, 15, 120, 121, 176, 193JavaScript, 14, 120Jeong, H., 49Joshi, M., 16, 60journal citations, 3Journal Impact Factor, 109, 112journal web sites, 110, 112Kamada, T., 221Kaski, S., 222, 223Katz, J.S., 89Kawai, S., 221k-core, 214, 215

KDViz. See knowledge domain visualisationKim, H.J., 70, 78, 113Kirstein, J., 71

Summary 267

Kleinberg, J., 25, 36, 127, 128, 129Kling, R., 104, 110, 116, 146knowledge domain visualization, 223Knudsen, I., 71Kohonen, T., 222, 223Kolmogorov-Smirnoff test, 153Koster, M., 17Kot, M., 146Krackhardt, D., 215Krippendorff, K., 151Kruskal, J.B., 232Kumar, R., 12, 13, 14, 32, 52, 53La Barre, K., 78Lagus, K., 222, 223Landes, W.M., 145, 147Larson, R., 2Lawrence, S., 51,52, 56, 62, 115, 150, 173, 184Lederbogen, U., 146Levene, M., 56Leydesdorff, L., 101Li, X., 6, 103, 105, 153, 159, 230, 233Lifantsev, M., 122Limoges, C , 99Lin. W.Y., 151linguistic influences, 96, 97, 98link analysis, 1, 2, 3, 23, 35, 39, 70, 85, 104, 109,

115,145link counting, 23, 28, 31,35, 51link databases, 203link extraction, 192link indicators, 227, 228, 230, 232link topologies, 52, 53, 54log file analysis, 114Logan, E., 115longitudinal analysis, 2, 207, 210Lotka, A., 49Lotka's Law. See power law theoryLu, S., 147Luce, R., 214Lundgren, T.D., 147Luukkonen, T., 93Macias-Cervi, P., 36, 233Macromedia Dreamweaver, 25Macromedia Flash, 15, 20, 176, 178, 193Maghoul, F., 12, 13, 14, 32, 52, 53Marshall, P., 146, 212Martinson, A., 116Matthew effect, 50, 61,95maximal spanning tree, 89McCain, K.W., 115,222McCaughey, M., 146McConnell, M., 70, 71, 228McGill, J., 121McGonagle, J.J., 175,180McKim, G., 104, 146McMillan, S., 151MDS. See multidimensional scaling

Menczer, F., 61Mendelzon, A.O., 120Mendez, A., 93Merton, R., 70, 75, 77, 110Mettrop, W., 21, 186Microsoft PowerPoint, 28Microsoft Site Analyst, 10Microsoft Word, 17, 28, 191, 193, 205Microsoft's Active Server Pages, 16Middleton, L, 70, 71, 228Mihaila, G.A., 120Milgram, S., 49Miller, H., 147Miller-Whitehead, M., 116mirror sites, 31Moed, H.F., 70, 72, 97, 98Moitra, S., 78Motwani, R., 122, 124multidimensional scaling, 221, 222Musgrove, P., 98, 222Najork, M., 14Nam, I., 9, 36Nantz, K.S., 147Negroponte, N., 88Nentwich, M., 64, 107,212network diagrams, 163, 164, 165, 166, 167, 168,

169,170,171,219,220,221Neuendorf, K., 27, 38,40,41,44Newman, J., 221Newman, R., 221Ng, A.Y., 132Nielsen,!, 127Nieuwenhuysen, P., 21, 186Nordstrom, R.D., 174, 180Nowotny, H., 99obscured links, 14, 193Open Directory Project, 19, 60, 61, 67Oppenheim, C , 3, 7,43, 78, 81, 110, 148, 149, 151Otte, E., 216outlink, 5, 6, 25, 87, 242Paepcke, A., 21

Page, L., 14, 21, 23, 24, 25, 36, 62, 120, 122, 124Page-Kennedy, T., 98, 222PageRank, 85, 121, 122, 123, 124, 125, 126, 127,

131,132Pajek, 163, 164, 167, 215, 221, 223Palmer, J.W., 36Papacharissi, Z., 147, 148, 149Park, H.W., 6, 9, 36, 96, 213, 216patent citations, 3PDF. See Portable Document FormatPearson correlation, 94, 153, 154Pennock, D., 16, 51, 56, 60, 62, 150personal web pages, 145, 146, 148, 154, 158Persson, O., 93PHP, 176. See Hypertext Pre-processorpilot study, 37


Pinkerton, R.L., 174, 180Portable Document Format, 17, 28, 114, 193Posner, R.A., 146, 147Poulovassilis, A., 56Powell, T., 192, 193power law theory, 48, 49, 51, 52, 54, 55, 242Price, E., 38, 39, 43, 64, 65, 73, 74, 76, 78, 97, 98,

103, 104, 132, 147, 151, 156, 159Price, M , 15Pruijt, H., 147Punera, K., 16, 60Pyle, D., 199, 201QAP. See Quadratic Assignment ProcedureQuadratic Assignment Procedure, 215qualitative analysis, 1, 37, 148, 150, 216quantitative analysis, 2, 3, 31, 37, 42, 43, 98, 148,

180,210,227Rada, R., 221RAE. See Research Assessment ExerciseRafiei, D., 120Raghavan, S., 21random sampling, 38range counting, 30ranking, 19, 174reliability, 4, 36, 37, 111, 228, 229Renn, S., 78replicated link, 25, 26, 76Research Assessment Exercise, 63, 81, 82, 89, 105,

153Ribeiro-Neto, B., 121, 133, 134Richardson, M , 132robot. See web crawlersrobots.txt, 17Rodriguez t Gain'n, J.M., 2, 109Rosen, R., 93Rosenbaum, H., I l l , 116, 186Rousseau, R., 2, 183, 186, 214, 216Rowland, F., 110Rudncr, L.M., 116Ruhl, M , 33Saint-Jean, P., 56, 132Salton, G., 121Scharnhorst, A., 228Schneider, S., 146, 209, 210, 212scholarly communication, 2, 77, 115Schubert, A., 93scientometrics, 3, 101search engines, 12, 15, 16, 18,26,31,38,61,94,

114, 119, 121, 173, 174,175,181, 183Sebastian, J., 93Seidman, S., 214self link, 5, 24, 25, 32, 157,243self-organizing maps, 222Shannon, C , 148Shaw, D., 78, 115Sheldon, M., 2

Sherman, G., 15Shockwave, 15, 193Silverman, E., 146Silvertsen, G., 93Sloan, B., 146small world phenomena, 49Small, H., 101Smith, A.G., 24, 36, 75, 83, 95, 96, 109, 111, 116,

182,186,219,232Smyth, P., 56SNA. See social network analysisSnyder, H., I l l , 116, 186social network analysis, 213, 215, 216social science perspective, 1, 209SocSciBot, 10, 14, 26, 31, 54, 82, 137, 138, 149,

163, 173, 174, 176, 189, 190, 191, 193, 203SocSciBot Tools, 163, 164, 189, 195, 196, 198,

202,204,206,207,215spam, 20, 26, 121, 132Spearman's rho, 82, 94, 153, 154spider. See web crawlersspider traps, 15, 16,26,31Strogatz, S.H.,49, 215Stubbs, P., 146Sullivan, D., 121Sweeney, A.E., 110Tabachnick, B.,42,200Takkadorie, A., 6Tang, R., 83, 97, 98, 104, 105, 106, 116, 146, 154,

158,215Tashakkori, A., 147, 148Teddlie, C , 7, 147, 148Teoma, 127Thelwall, M., 2, 6, 10, 13, 14, 17, 19, 20, 24, 25,

30, 31, 32, 36, 54, 55, 63, 64, 65, 66, 75, 76, 81,82, 83, 84, 85, 87, 88, 89, 90, 94, 95, 96, 97, 98,103, 104, 105, 106, 110,111, 112,113, 116,120, 131, 132, 137, 142, 145, 146, 147, 149,150, 151, 152, 153, 154, 156, 157, 158, 159,173, 174, 184, 186, 196, 201, 206, 213, 214,215, 216, 219, 220, 222, 230, 231, 232

theoretical physics perspective, 1,50Thomas, O., 105, 106Tijssen, R.J.W., 97TLD spectral analysis, 31, 32, 200, 201Trebbe.J., 146Treise. D., 146truncating, 11Tsioutsiouliklis, K., 62UCINET, 215Underwood, J., 175university interlinking, 62, 74, 76, 77, 81, 84, 89,

94, 96, 98, 103, 106, 144, 154, 158, 168, 169university web sites, 62, 69, 70, 71, 72, 75, 82, 145,

148, 164,203URL crawlers, 11

Summary 269

validity, 4, 36van Leeuwen, T., 97van Raan, A.F.J., 6, 70, 228vanSelm, M., 212Vaughan, L., 2, 6, 19, 20, 42, 44, 63, 83, 88, 89, 97,

98, 110, 111,112, 113, 115,132, 149,180,184vector space model, 120, 121, 133, 134Velez, B., 2Vella, CM., 175, 180Vibert,C, 174Vlado, A., 163Vreeland, R.C., 35, 116Walsh-Childers, K., 146wanderer. See web crawlersWatts, D.J., 49, 215WCF. See Web Connectivity FactorWeare, C , 151Weaver, W., 148Web Connectivity Factor, 232web crawlers, 9, 10, 11, 12, 14, 15, 16,17,53,111,

189web document, 27, 28web growth, 50, 61Web Impact Factors, 87, 231web page design, 119, 131,173

web pages, 17, 18,28web site, 6, 27Web Use Factors, 87,231WebKing, 10webometrics, 2, 6Weigold, M.F., 145, 146Weiss, R., 2White, H.D., 101,222WIF. See Web Impact FactorWilkinson, D., 6, 13, 30, 32, 38, 39, 43, 54, 55, 63,

64,66,73,74,76,78, 103,104, 147, 150,151,153, 156, 159

Willett, P., 105, 106WinHTTrack, 152Winograd, T., 122, 124WISER, 99, 228, 233Wish, M., 232Wormell.I., 180Wouters, P., 114WUF. See Web Use FactorsYahoo!, 19, 24, 49, 61, 66, 132, 185, 243Zazo, A., 233Zhao, D., 115Zinkhan, CM., 147