Link AnalysisAn Information Science Approach
Recent and Forthcoming Volumes
Leo EgghePower Laws in the Information Production Process: Lotkaian Informetrics
Donald CaseLooking for Information
Matthew Locke Saxton and John V. RichardsonUnderstanding Reference Transactions: Turning Art Into a Science
Robert M. HayesModels for Library Management, Decision-Making, and Planning
Charles T. Meadow, Bert R. Boyce, and Donald H. KraftText Information Retrieval Systems, Second Edition
Charles T. MeadowText Information Retrieval Systems
A.J. MeadowsCommunicating Research
V. Frants,J. Shapiro, & V. VotskunskiiAutomated Information Retrieval: Theory and Methods
Harold SackmanBiomedical Information Technology: Global Social Responsibilities for theDemocratic Age
Peter ClaytonImplementation of Organizational Innovation: Studies of Academic and ResearchLibraries
Bryce L. AllenInformation Tasks: Toward a User-Centered Approach to Information Systems
Library and Information Science
Series Editor: Bert R. BoyceSchool of Library & Information ScienceLouisiana State University, Baton Rouge
Amsterdam - Boston - Heidelberg - London - New York - OxfordParis - San Diego - San Francisco - Singapore - Sydney - Tokyo
Link AnalysisAn Information Science Approach
ELSEVIER B.V. ELSEVIER Inc. ELSEVIER Ltd. ELSEVIER Ltd.Radarweg 29 525 B Street, Suite 1900 The Boulevard, Langford Lane 84 Theobalds RoadP.O. Box 211, 1000 AE Amsterdam San Diego, CA 92101-4495 Kidlington, Oxford OX5 1GB London WC1X 8RRThe Netherlands USA UK UK
2004 Elsevier Inc. All rights reserved.
This work is protected under copyright by Elsevier Inc., and the following terms and conditions apply to its use:
Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permissionof the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying,copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are availablefor educational institutions that wish to make photocopies for non-profit educational classroom use.
Permissions may be sought directly from Elsevier's Rights Department in Oxford, UK: phone (+44) 1865 843830, fax(+44) 1865 853333, email: firstname.lastname@example.org. Requests may also be completed on-line via the Elsevierhomepage (http://www.elsevier.com/locate/ permissions).
In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK throughthe Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK;phone: (+44) 20 7631 5555; fax: (+44) 20 7631 5500. Other countries may have a local reprographic rights agency forpayments.
Derivative WorksTables of contents may be reproduced for internal circulation, but permission of the Publisher is required for externalresale or distribution of such material. Permission of the Publisher is required for all other derivative works, includingcompilations and translations.
Electronic Storage or UsagePermission of the Publisher is required to store or use electronically any material contained in this work, including anychapter or part of a chapter.
Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any formor by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of thePublisher.Address permissions requests to: Elsevier's Rights Department, at the fax and e-mail addresses noted above.
NoticeNo responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of productsliability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas containedin the material herein. Because of rapid advances in the medical sciences, in particular, independent verification ofdiagnoses and drug dosages should be made.
First edition 2004
@ The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper).Printed in The Netherlands.
Link Analysis: An Information Science Approach
Part I: Theory 11 1Introduction 1
Objectives 1Link analysis 1Historical overview 2What is the information science approach to link analysis? 3Contents and structure 4Key terminology 5Summary 6Further reading 6References 7
2 9Web crawlers and search engines 9
Objectives 9Introduction 9Web crawlers 9
Finding pages 11Content crawling vs. URL crawling 11Content crawling 14Obscured links 14Depth and other arbitrary limitations 15Automatically generated pages 15Ethical issues and robots.txt 17The web page 17Web crawling summary 18
Search engines 18Known biases 19Search engine ranking 20
The Internet Archive 20Summary 20Further reading 21References 21
3 23The theoretical perspective for link counting 23
Objectives 23Introduction 23The theoretical perspective for link counting 23Anomalies 24Manual filtering and banned lists 26Alternative Document Models 27
Web sites and web documents 27ADMs and standard ADM counting 29ADM range counting models 30
Choosing link counting strategies 31
vi Link Analysis: An Information Science Approach
Summary 32Further reading 32References 33
4 35Interpreting link counts: Random samples and correlations 35
Objectives 35Introduction 35Interpreting link counts 35The pilot feasibility and validity study 37Full-scale random sampling 38Confidence limits for categories 40Correlation testing 41Literature review 43Summary 43Further reading 43References 44
Part II: web structure 475 47Link structures in the web graph 47
Objectives 47Introduction 47Power laws in the web 48Models of web growth 50Link topologies 52Power laws and link topologies in academic webs 54Summary 55Further reading 56References 56
6 59The content structure of the web 59
Objectives 59Introduction 59The topic structure of the web 60A link-content web growth model 61Link text 62The subject structure of academic webs 62Colinks 66Summary 66Further reading 67References 67
III Academic links 697 69Universities: Link types 69
Objectives 69Introduction 69Citation analysis 69The role of a university web site 70
National systems of university web sites 71Page types 72Link types 75Summary 77Further reading 78References 78
8 81Universities: Link models 81
Objectives 81Introduction 81The relationship between inlinks and research 81Academic linking: Quality vs. quantity 84Alternative logical linking models 86Mathematical models 87The influence of geography 88Regional groupings 89Summary 91References 91
9 93Universities: International links 93
Objectives 93Introduction 93National vs. international links 94International linking comparisons 95Linguistic influences 96Summary 98Further reading 99References 99
10 101Departments and subjects 101
Objectives 101Introduction 101Departmental web sites 102Disciplinary differences in link types 103issues of scale and correlation tests 104
Country 105Subject 105Outcome 105
Geographic and international factors 106Summary 106Further reading 107References 107
11 109Journals and articles 109
Objectives 109Introduction 109Journal Impact Factors 109Journal web sites 110
viii Link Analysis: An Information Science Approach
Journal web site inlinks: Issues I l lJournal web site inlinks: Case study 112Types of links in journal articles 113Digital library links 114Combined link and log file analysis 114Related research topics 115Summary 116Further reading 116References 116
IV Applications 11912 119Search engines and web design 119
Objectives 119Introduction 119Link structures and crawler coverage 119Text in web sites and the Vector Space Model 120The PageRank algorithm 121Case study: PageRank calculations for a gateway site 124HITS 127HITS worked example 128Summary: Web site design for PageRank and HITS 131Further reading 132Appendix: the Vector Space Model 133References 134
13 137A health check for Spanish universities 137
objective 137Introduction 137Research questions 137Methods 138Results and discussion 138Conclusion 144References 144
14 145Personal web pages linking to universities 145
Objectives 145Introduction 145Web publishing and personal home pages 146Research questions 147Methods 148
Data collection 148Data analysis 149
Results 151ISP bias test 151ADM fitting 152Correlations between links and research ratings 153A comparison of university and home page link sources 154
Individual page categorizations 155Conclusion 158Meta-conclusions 159Acknowedgement 159References 160
15 163Academic networks 163
Objectives 163Introduction 163Methods 163University sitemaps 164National academic web maps 168Subject maps 170Summary 171Further reading 171References 172
16 173Business web sites 173
Objectives 173Introduction 173Site coverage checks 173Site indexing and ranking checks 174Competitive intelligence 174Case study 175
Center Pares 176Hoseasons 176Butlins 177Pontins 178Haven Holidays 178General queries 179
Summary 179Further reading 180References 180
V Tools and techniques 18117 181Using commercial search engines and the Internet Archive 181
Objectives 181Introduction 181Checking results 182Dealing with variations in results 183Using multiple search engines 184Using the Internet Archive 184Summary 185Online resources 185Further reading 186References 186
18 189Personal crawlers 189
x Link Analysis: An Information Science Approach
Objectives 189Introduction 189Types of personal crawler 189SocSciBot 190
Web page retrieved 190Web page qualification 191Web link extraction 192URLs from HTTP 192Obscured or unspecified URLs 193Server-generated pages 193Dealing with errors 194Human intervention during crawls 195
SocSciBot tools 195Summary 196Online resources 196Further reading 196References 197
19 199Data cleansing 199
Objectives 199Introduction 199Overview of data cleansing techniques 199Anomaly identification 200TLD Spectral Analysis 201Summary 201Online resources 202References 202
20 203Online university link databases 203
Objective 203Introduction 203Overview of the link databases 203Link structure files 204The banned lists 205Analyzing the data 206Other link structure databases 206Summary 206Online resources 206Further reading 206Reference 208
21 209Embedded link analysis methodologies 209
Objectives 209Introduction 209Web Sphere Analysis 210Virtual ethnography 210Summary 211
Further reading 212References 212
22 213Social Network Analysis 213
Objectives 213Introduction 213Some SNA metrics 214Software 215Summary 216Further reading 216References 216
23 219Network visualizations 219
Objectives 219Introduction 219Network diagrams 219Large network diagrams 221MultiDimensional Scaling 221Self-Organizing Maps 222Knowledge Domain Visualisation 223Summary 223Online resources 223References 223
24 227Academic link indicators 227
Objective 227Introduction 227Web indicators as process indicators 228Issues of size and reliability 228Benchmarking indicators 230Link metrics 230Relational indicators 232Other metrics 232Summary 233Further reading 233References 234
VI Summary 23725 237Summary 237
Objectives 237Introduction 237information science contributions to link analysis 238Other link analysis approaches 239Future directions 240
26 241Glossary 241
References 243Appendix 245
xii Link Analysis: An Information Science Approach
A SocSciBot tutorial 245Tutorial 245
Step 1: Installing SocSciBot, SocSciBot Tools and Cyclist 245Step 2: Installing Pajek 247Step 3: Crawling a first site with SocSciBot 247Step 4: Crawling two more sites with SocSciBot 252Step 5: Viewing basic reports about the "small test" project with SocSciBot Tools 253Step 6: Viewing a network diagram with Pajek 257Step 7: Viewing site diagrams with Pajek 261Step 8: Using Cyclist 263
Summary 264Index 265
PART I: THEORY
To introduce the content and structure of the book and some key terminology. To outline the information science approach to link analysis.
Link analysis is performed in very diverse subjects, from computer science and theoreticalphysics to information science, communication studies and sociology. This is a testament bothto the importance of the web and to a widespread belief that hyperlinks between web pagescan yield useful information of one kind or another. This belief probably stems from severalrelated factors: the success of Google, which uses a link-based algorithm for identifying thebest pages; analogies with other phenomena, such as journal citations and social connections;and probably also links being 'in your face' all the time, whether using the web for research,business or recreation.
In this book, an information science approach to link analysis is set out with theprinciple aim of introducing it to a new audience. This new audience will then be able tocritically evaluate existing research and develop their own research projects and methods. It isa central belief of this book that the information science approach is widely useful to otherresearchers, particularly social scientists interested in analyzing phenomena with an onlinecomponent. No attempt is made to give comprehensive coverage of all different types of linkanalysis: such an enterprise would fail between the detail of the mathematics used in someareas and the qualitative approach used in others. The information science theme of the book
2 Link Analysis: An Information Science Approach
has resulted in at least half of its content being related to the study of academic web use orscholarly communication. Readers may therefore also gain additional insights into scholarlycommunication.
The book seeks to answer four main questions. Which kinds of information can be extracted by analyzing the hyperlinks between a
set of web pages or sites? Which techniques should be used? What are the likely pitfalls of link analysis? How can and should a link analysis be conducted in practice?
The start of published web link analysis research appears to date from 1995-1996, occurringsimultaneously in several disciplines, including computer science for search enginedevelopment (e.g., Weiss, Velez, Sheldon et al., 1996), and mathematics for structure andcomplexity analysis (e.g., Abraham, 1996). The first information scientist to publish adiscussion of the potential for transferring information science techniques to the Internetappears to be the Brazilian Marcia J. Bossy (1995), with an article in a French online journal.The first published information science link analysis seems to be that of Larson (1996). His"Bibliometries of the World Wide Web: An exploratory analysis of the intellectual structureof cyberspace" presentation at the American Society for Information Science conferenceexplicitly ad...