Link Analysis: An Information Science Approach (Library and Information Science) (Library and Information Science)

  • Published on

  • View

  • Download

Embed Size (px)


  • Link AnalysisAn Information Science Approach

  • Recent and Forthcoming Volumes

    Leo EgghePower Laws in the Information Production Process: Lotkaian Informetrics

    Donald CaseLooking for Information

    Matthew Locke Saxton and John V. RichardsonUnderstanding Reference Transactions: Turning Art Into a Science

    Robert M. HayesModels for Library Management, Decision-Making, and Planning

    Charles T. Meadow, Bert R. Boyce, and Donald H. KraftText Information Retrieval Systems, Second Edition

    Charles T. MeadowText Information Retrieval Systems

    A.J. MeadowsCommunicating Research

    V. Frants,J. Shapiro, & V. VotskunskiiAutomated Information Retrieval: Theory and Methods

    Harold SackmanBiomedical Information Technology: Global Social Responsibilities for theDemocratic Age

    Peter ClaytonImplementation of Organizational Innovation: Studies of Academic and ResearchLibraries

    Bryce L. AllenInformation Tasks: Toward a User-Centered Approach to Information Systems

    Library and Information Science

    Series Editor: Bert R. BoyceSchool of Library & Information ScienceLouisiana State University, Baton Rouge

  • Mike Thelwall




    Amsterdam - Boston - Heidelberg - London - New York - OxfordParis - San Diego - San Francisco - Singapore - Sydney - Tokyo

    Link AnalysisAn Information Science Approach

  • ELSEVIER B.V. ELSEVIER Inc. ELSEVIER Ltd. ELSEVIER Ltd.Radarweg 29 525 B Street, Suite 1900 The Boulevard, Langford Lane 84 Theobalds RoadP.O. Box 211, 1000 AE Amsterdam San Diego, CA 92101-4495 Kidlington, Oxford OX5 1GB London WC1X 8RRThe Netherlands USA UK UK

    2004 Elsevier Inc. All rights reserved.

    This work is protected under copyright by Elsevier Inc., and the following terms and conditions apply to its use:


    Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permissionof the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying,copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are availablefor educational institutions that wish to make photocopies for non-profit educational classroom use.

    Permissions may be sought directly from Elsevier's Rights Department in Oxford, UK: phone (+44) 1865 843830, fax(+44) 1865 853333, email: Requests may also be completed on-line via the Elsevierhomepage ( permissions).

    In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK throughthe Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK;phone: (+44) 20 7631 5555; fax: (+44) 20 7631 5500. Other countries may have a local reprographic rights agency forpayments.

    Derivative WorksTables of contents may be reproduced for internal circulation, but permission of the Publisher is required for externalresale or distribution of such material. Permission of the Publisher is required for all other derivative works, includingcompilations and translations.

    Electronic Storage or UsagePermission of the Publisher is required to store or use electronically any material contained in this work, including anychapter or part of a chapter.

    Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any formor by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of thePublisher.Address permissions requests to: Elsevier's Rights Department, at the fax and e-mail addresses noted above.

    NoticeNo responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of productsliability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas containedin the material herein. Because of rapid advances in the medical sciences, in particular, independent verification ofdiagnoses and drug dosages should be made.

    First edition 2004

    ISBN: 0-12-088553-0

    @ The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper).Printed in The Netherlands.

  • Introduction v

    Link Analysis: An Information Science Approach

    Part I: Theory 11 1Introduction 1

    Objectives 1Link analysis 1Historical overview 2What is the information science approach to link analysis? 3Contents and structure 4Key terminology 5Summary 6Further reading 6References 7

    2 9Web crawlers and search engines 9

    Objectives 9Introduction 9Web crawlers 9

    Finding pages 11Content crawling vs. URL crawling 11Content crawling 14Obscured links 14Depth and other arbitrary limitations 15Automatically generated pages 15Ethical issues and robots.txt 17The web page 17Web crawling summary 18

    Search engines 18Known biases 19Search engine ranking 20

    The Internet Archive 20Summary 20Further reading 21References 21

    3 23The theoretical perspective for link counting 23

    Objectives 23Introduction 23The theoretical perspective for link counting 23Anomalies 24Manual filtering and banned lists 26Alternative Document Models 27

    Web sites and web documents 27ADMs and standard ADM counting 29ADM range counting models 30

    Choosing link counting strategies 31

  • vi Link Analysis: An Information Science Approach

    Summary 32Further reading 32References 33

    4 35Interpreting link counts: Random samples and correlations 35

    Objectives 35Introduction 35Interpreting link counts 35The pilot feasibility and validity study 37Full-scale random sampling 38Confidence limits for categories 40Correlation testing 41Literature review 43Summary 43Further reading 43References 44

    Part II: web structure 475 47Link structures in the web graph 47

    Objectives 47Introduction 47Power laws in the web 48Models of web growth 50Link topologies 52Power laws and link topologies in academic webs 54Summary 55Further reading 56References 56

    6 59The content structure of the web 59

    Objectives 59Introduction 59The topic structure of the web 60A link-content web growth model 61Link text 62The subject structure of academic webs 62Colinks 66Summary 66Further reading 67References 67

    III Academic links 697 69Universities: Link types 69

    Objectives 69Introduction 69Citation analysis 69The role of a university web site 70

  • Introduction vii

    National systems of university web sites 71Page types 72Link types 75Summary 77Further reading 78References 78

    8 81Universities: Link models 81

    Objectives 81Introduction 81The relationship between inlinks and research 81Academic linking: Quality vs. quantity 84Alternative logical linking models 86Mathematical models 87The influence of geography 88Regional groupings 89Summary 91References 91

    9 93Universities: International links 93

    Objectives 93Introduction 93National vs. international links 94International linking comparisons 95Linguistic influences 96Summary 98Further reading 99References 99

    10 101Departments and subjects 101

    Objectives 101Introduction 101Departmental web sites 102Disciplinary differences in link types 103issues of scale and correlation tests 104

    Country 105Subject 105Outcome 105

    Geographic and international factors 106Summary 106Further reading 107References 107

    11 109Journals and articles 109

    Objectives 109Introduction 109Journal Impact Factors 109Journal web sites 110

  • viii Link Analysis: An Information Science Approach

    Journal web site inlinks: Issues I l lJournal web site inlinks: Case study 112Types of links in journal articles 113Digital library links 114Combined link and log file analysis 114Related research topics 115Summary 116Further reading 116References 116

    IV Applications 11912 119Search engines and web design 119

    Objectives 119Introduction 119Link structures and crawler coverage 119Text in web sites and the Vector Space Model 120The PageRank algorithm 121Case study: PageRank calculations for a gateway site 124HITS 127HITS worked example 128Summary: Web site design for PageRank and HITS 131Further reading 132Appendix: the Vector Space Model 133References 134

    13 137A health check for Spanish universities 137

    objective 137Introduction 137Research questions 137Methods 138Results and discussion 138Conclusion 144References 144

    14 145Personal web pages linking to universities 145

    Objectives 145Introduction 145Web publishing and personal home pages 146Research questions 147Methods 148

    Data collection 148Data analysis 149

    Results 151ISP bias test 151ADM fitting 152Correlations between links and research ratings 153A comparison of university and home page link sources 154

  • Introduction ix

    Individual page categorizations 155Conclusion 158Meta-conclusions 159Acknowedgement 159References 160

    15 163Academic networks 163

    Objectives 163Introduction 163Methods 163University sitemaps 164National academic web maps 168Subject maps 170Summary 171Further reading 171References 172

    16 173Business web sites 173

    Objectives 173Introduction 173Site coverage checks 173Site indexing and ranking checks 174Competitive intelligence 174Case study 175

    Center Pares 176Hoseasons 176Butlins 177Pontins 178Haven Holidays 178General queries 179

    Summary 179Further reading 180References 180

    V Tools and techniques 18117 181Using commercial search engines and the Internet Archive 181

    Objectives 181Introduction 181Checking results 182Dealing with variations in results 183Using multiple search engines 184Using the Internet Archive 184Summary 185Online resources 185Further reading 186References 186

    18 189Personal crawlers 189

  • x Link Analysis: An Information Science Approach

    Objectives 189Introduction 189Types of personal crawler 189SocSciBot 190

    Web page retrieved 190Web page qualification 191Web link extraction 192URLs from HTTP 192Obscured or unspecified URLs 193Server-generated pages 193Dealing with errors 194Human intervention during crawls 195

    SocSciBot tools 195Summary 196Online resources 196Further reading 196References 197

    19 199Data cleansing 199

    Objectives 199Introduction 199Overview of data cleansing techniques 199Anomaly identification 200TLD Spectral Analysis 201Summary 201Online resources 202References 202

    20 203Online university link databases 203

    Objective 203Introduction 203Overview of the link databases 203Link structure files 204The banned lists 205Analyzing the data 206Other link structure databases 206Summary 206Online resources 206Further reading 206Reference 208

    21 209Embedded link analysis methodologies 209

    Objectives 209Introduction 209Web Sphere Analysis 210Virtual ethnography 210Summary 211

  • Introduction xi

    Further reading 212References 212

    22 213Social Network Analysis 213

    Objectives 213Introduction 213Some SNA metrics 214Software 215Summary 216Further reading 216References 216

    23 219Network visualizations 219

    Objectives 219Introduction 219Network diagrams 219Large network diagrams 221MultiDimensional Scaling 221Self-Organizing Maps 222Knowledge Domain Visualisation 223Summary 223Online resources 223References 223

    24 227Academic link indicators 227

    Objective 227Introduction 227Web indicators as process indicators 228Issues of size and reliability 228Benchmarking indicators 230Link metrics 230Relational indicators 232Other metrics 232Summary 233Further reading 233References 234

    VI Summary 23725 237Summary 237

    Objectives 237Introduction 237information science contributions to link analysis 238Other link analysis approaches 239Future directions 240

    26 241Glossary 241

    References 243Appendix 245

  • xii Link Analysis: An Information Science Approach

    A SocSciBot tutorial 245Tutorial 245

    Step 1: Installing SocSciBot, SocSciBot Tools and Cyclist 245Step 2: Installing Pajek 247Step 3: Crawling a first site with SocSciBot 247Step 4: Crawling two more sites with SocSciBot 252Step 5: Viewing basic reports about the "small test" project with SocSciBot Tools 253Step 6: Viewing a network diagram with Pajek 257Step 7: Viewing site diagrams with Pajek 261Step 8: Using Cyclist 263

    Summary 264Index 265

  • Introduction 1




    To introduce the content and structure of the book and some key terminology. To outline the information science approach to link analysis.


    Link analysis is performed in very diverse subjects, from computer science and theoreticalphysics to information science, communication studies and sociology. This is a testament bothto the importance of the web and to a widespread belief that hyperlinks between web pagescan yield useful information of one kind or another. This belief probably stems from severalrelated factors: the success of Google, which uses a link-based algorithm for identifying thebest pages; analogies with other phenomena, such as journal citations and social connections;and probably also links being 'in your face' all the time, whether using the web for research,business or recreation.

    In this book, an information science approach to link analysis is set out with theprinciple aim of introducing it to a new audience. This new audience will then be able tocritically evaluate existing research and develop their own research projects and methods. It isa central belief of this book that the information science approach is widely useful to otherresearchers, particularly social scientists interested in analyzing phenomena with an onlinecomponent. No attempt is made to give comprehensive coverage of all different types of linkanalysis: such an enterprise would fail between the detail of the mathematics used in someareas and the qualitative approach used in others. The information science theme of the book


  • 2 Link Analysis: An Information Science Approach

    has resulted in at least half of its content being related to the study of academic web use orscholarly communication. Readers may therefore also gain additional insights into scholarlycommunication.

    The book seeks to answer four main questions. Which kinds of information can be extracted by analyzing the hyperlinks between a

    set of web pages or sites? Which techniques should be used? What are the likely pitfalls of link analysis? How can and should a link analysis be conducted in practice?


    The start of published web link analysis research appears to date from 1995-1996, occurringsimultaneously in several disciplines, including computer science for search enginedevelopment (e.g., Weiss, Velez, Sheldon et al., 1996), and mathematics for structure andcomplexity analysis (e.g., Abraham, 1996). The first information scientist to publish adiscussion of the potential for transferring information science techniques to the Internetappears to be the Brazilian Marcia J. Bossy (1995), with an article in a French online journal.The first published information science link analysis seems to be that of Larson (1996). His"Bibliometries of the World Wide Web: An exploratory analysis of the intellectual structureof cyberspace" presentation at the American Society for Information Science conferenceexplicitly ad...


View more >