Para Larch

Standard Web Search Engine Architecturecrawl thewebcreate an invertedindexCheck for duplicates,store the documentsInverted indexSearch engine serversuserqueryShow results To userDocIds

More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.

Indexes for Web Search EnginesInverted indexes are still used, even though the web is so hugeMost current web search systems partition the indexes across different machinesEach machine handles different parts of the data (Google uses thousands of PC-class processors and keeps most things in main memory)Other systems duplicate the data across many machinesQueries are distributed among the machinesMost do a combination of these

Search Engine QueryingIn this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries.

Each row can handle 120 queries per second

Each column can handle 7M pages

To handle more queries, add another row.From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Querying: Cascading Allocation of CPUsA variation on this that produces a cost-savings:Put high-quality/common pages on many machinesPut lower quality/less common pages on fewer machinesQuery goes to high quality machines firstIf no hits found there, go to other machines

Google Google maintains (probably) the worlds largest Linux cluster (over 15,000 servers) These are partitioned between index servers and page serversIndex servers resolve the queries (massively parallel processing)Page servers deliver the results of the queriesOver 8 Billion web pages are indexed and served by Google

Search Engine IndexesStarting Points for Users includeManually compiled listsDirectoriesPage popularityFrequently visited pages (in general)Frequently visited pages as a result of a queryLink co-citationWhich sites are linked to by other sites?

Starting Points: What is Really Being Used?Todays search engines combine these methods in various waysIntegration of DirectoriesToday most web search engines integrate categories into the results listingsLycos, MSN, GoogleLink analysisGoogle uses it; others are also using itWords on the links seems to be especially usefulPage popularityMany use DirectHits popularity rankings

Web Page RankingVaries by search enginePretty messy in many casesDetails usually proprietary and fluctuatingCombining subsets of:Term frequenciesTerm proximitiesTerm position (title, top of page, etc)Term characteristics (boldface, capitalized, etc)Link analysis informationCategory informationPopularity information

Ranking: Hearst 96Proximity search can help get high-precision results if >1 termCombine Boolean and passage-level proximityProves significant improvements when retrieving top 5, 10, 20, 30 documentsResults reproduced by Mitra et al. 98Google uses something similar

Ranking: Link AnalysisAssumptions:If the pages pointing to this page are good, then this is also a good pageThe words on the links pointing to this page are useful indicators of what this page is aboutReferences: Page et al. 98, Kleinberg 98

Ranking: Link AnalysisWhy does this work?The official Toyota site will be linked to by lots of other official (or high-quality) sitesThe best Toyota fan-club site probably also has many links pointing to itLess high-quality sites do not have as many high-quality sites linking to them

Ranking: PageRankGoogle uses the PageRankWe assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. d is usually set to 0.85. C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one

PageRank T2Pr=1T1Pr=.725T6Pr=1T5Pr=1T4Pr=1T3Pr=1T7Pr=1T8Pr=2.46625X1X2APr=4.2544375Note: these are not real PageRanks, since they include values >= 1

PageRankSimilar to calculations used in scientific citation analysis (e.g., Garfield et al.) and social network analysis (e.g., Waserman et al.)Similar to other work on ranking (e.g., the hubs and authorities of Kleinberg et al.)How is Amazon similar to Google in terms of the basic insights and techniques of PageRank?How could PageRank be applied to other problems and domains?

Today ReviewWeb Crawling and Search IssuesWeb Search Engines and AlgorithmsWeb Search ProcessingParallel Architectures (Inktomi Eric Brewer)Cheshire III Design

Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

Digital Library Grid Initiatives:Cheshire3 and the GridRay R. LarsonUniversity of California, BerkeleySchool of Information Management and Systems Rob SandersonUniversity of LiverpoolDept. of Computer ScienceThanks to Dr. Eric Yen and Prof. Michael Buckland for parts of this presentationPresentation from DLF Forum April 2005

OverviewThe Grid, Text Mining and Digital LibrariesGrid ArchitectureGrid IR IssuesCheshire3: Bringing Search to Grid-Based Digital LibrariesOverviewGrid ExperimentsCheshire3 ArchitectureDistributed Workflows

Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)Grid middlewareChemicalEngineeringApplications

ApplicationToolkits

GridServices

GridFabricClimateData GridRemoteComputingRemoteVisualizationCollaboratoriesHigh energy physicsCosmologyAstrophysicsCombustion..PortalsRemotesensors..Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.Storage, networks, computers, display devices, etc.and their associated local services

Grid Architecture (ECAI/AS Grid Digital Library Workshop)ChemicalEngineeringApplications

ApplicationToolkits

GridServices

GridFabricGrid middlewareClimateData GridRemoteComputingRemoteVisualizationCollaboratoriesHigh energy physicsCosmologyAstrophysicsCombustionHumanitiescomputingDigitalLibrariesPortalsRemotesensorsText MiningMetadatamanagementSearch &RetrievalProtocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.Storage, networks, computers, display devices, etc.and their associated local servicesBio-Medical

Grid-Based Digital LibrariesLarge-scale distributed storage requirements and technologiesOrganizing distributed digital collectionsShared Metadata standards and requirementsManaging distributed digital collectionsSecurity and access controlCollection Replication and backupDistributed Information Retrieval issues and algorithms

Grid IR IssuesWant to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed)Very large-scale distribution of resources is a challenge for sub-second retrievalDifferent from most other typical Grid processes, IR is potentially less computing intensive and more data intensiveIn many ways Grid IR replicates the process (and problems) of metasearch or distributed search

Cheshire3 OverviewXML Information Retrieval Engine 3rd Generation of the UC Berkeley Cheshire system, as co-developed at the University of Liverpool.Uses Python for flexibility and extensibility, but imports C/C++ based libraries for processing speedStandards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few.Grid capable. Uses distributed configuration files, workflow definitions and PVM (currently) to scale from one machine to thousands of parallel nodes.Free and Open Source Software. (GPL Licence)http://www.cheshire3.org/ (under development!)

Cheshire3 Server Overview

Cheshire3 Grid TestsRunning on an 30 processor cluster in Liverpool using PVM (parallel virtual machine)Using 16 processors with one master and 22 slave processes we were able to parse and index MARC data at about 13000 records per secondOn a similar setup 610 Mb of TEI data can be parsed and indexed in seconds

SRB and SDSC ExperimentsWe are working with SDSC to include SRB supportWe are planning to continue working with SDSC and to run further evaluations using the TeraGrid server(s) through a small grant for 30000 CPU hours SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes, each with dual 1.5 GHz Intel Itanium 2 processors, for a peak performance of 3.1 teraflops. The nodes are equipped with four gigabytes (GBs) of physical memory per node. The cluster is running SuSE Linux and is using Myricom's Myrinet cluster interconnect network.Planned large-scale test collections include NSDL, the NARA repository, CiteSeer and the million books collections of the Internet Archive

Cheshire3 Object ModelRecordProtocolHandler

Cheshire3 Data ObjectsDocumentGroup: A collection of Document objects (e.g. from a file, directory, or external search)Document:A single item, in any format (e.g. PDF file, raw XML string, relational table)Record:A single item, represented as parsed XML Query:A search query, in the form of CQL (an abstract query language for Information Retrieval)ResultSet:An ordered list of pointers to recordsIndex:An ordered list of terms extracted from Records

Cheshire3 Process ObjectsPreParser: Given a Document, transform it into another Document (e.g. PDF to Text, Text to XML)Parser:Given a Document as a raw XML string, return a parsed Record for the item. Transformer:Given a Record, transform it into a Document (e.g. via XSLT, from XML to PDF, or XML to relational table)Extracter:Extract terms of a given type from an XML sub-tree (e.g. extract Dates, Keywords, Exact string value)Normaliser:Given the results of an extracter, transform the terms, maintaining the data structure (e.g. CaseNormaliser)

Cheshire3 Abstract ObjectsServer: A logical collection of databasesDatabase:A logical collection of Documents, their Record representations and Indexes of extracted terms. Workflow:A 'meta-process' object that takes a workflow definition in XML and converts it into executable code.

Workflow ObjectsWorkflows are first class objects in Cheshire3 (though not represented in the model diagram)All Process and Abstract objects have individual XML configurations with a common base schema with extensionsWe can treat configurations as Records and store in regular RecordStores, allowing access via regular IR protocols.

Workflow ReferencesWorkflows contain a series of instructions to perform, with reference to other Cheshire3 objectsReference is via pseudo-unique identifiers Pseudo because they are unique within the current context (Server vs Database)Workflows are objects, so this enables server level workflows to call database specific workflows with the same identifier

Distributed ProcessingEach node in the cluster instantiates the configured architecture, potentially through a single ConfigStore.Master nodes then run a high level workflow to distribute the processing amongst Slave nodes by reference to a subsidiary workflowAs object interaction is well defined in the model, the result of a workflow is equally well defined. This allows for the easy chaining of workflows, either locally or spread throughout the cluster.

Workflow Example1

workflow.SimpleWorkflow

Starting Load

Workflow Example2

workflow.SimpleWorkflow

Unparsable Record Loaded Record

Workflow StandardsCheshire3 workflows do not conform to any standard schemaIntentional:Workflows are specific to and dependent on the Cheshire3 architectureReplaces the distribution of lines of code for distributed processingReplaces many lines of code in generalNeeds to be easy to understand and createGUI workflow builder coming (web and standalone)

External IntegrationLooking at integration with existing cross-service workflow systems, in particular Kepler/PtolemyPossible integration at two levels:Cheshire3 as a service (black box) ... Identify a workflow to call.Cheshire3 object as a service (duplicate existing workflow function) But recall the access speed issue.

ConclusionsScalable Grid-Based digital library services can be created and provide support for very large collections with improved efficiencyThe Cheshire3 IR and DL architecture can provide Grid (or single processor) services for next-generation DLsAvailable as open source via:http://cheshire3.sourceforge.net orhttp://www.cheshire3.org/

Plan for todayWrap up spamCrawlingConnectivity servers

Link-based rankingMost search engines use hyperlink information for rankingBasic idea: Peer endorsementWeb page authors endorse their peers by linking to themPrototypical link-based ranking algorithm: PageRankPage is important if linked to (endorsed) by many other pagesMore so if other pages are themselves importantMore later

Link spamLink spam: Inflating the rank of a page by creating nepotistic links to itFrom own sites: Link farmsFrom partner sites: Link exchangesFrom unaffiliated sites (e.g. blogs, web forums, etc.)The more links, the betterGenerate links automaticallyUse scripts to post to blogsSynthesize entire web sites (often infinite number of pages)Synthesize many web sites (DNS spam; e.g. *.thrillingpage.info)The more important the linking page, the betterBuy expired highly-ranked domainsPost to high-quality blogs

Link farms and link exchanges

More spam techniques CloakingServe fake content to search engine spiderDNS cloaking: Switch IP address. Impersonate Cloaking

Tutorial onCloaking & StealthTechnology

More spam techniquesDoorway pagesPages optimized for a single keyword that re-direct to the real target pageRobotsFake query stream rank checking programsCurve-fit ranking programs of search enginesMillions of submissions via Add-Url

Acid testWhich SEOs rank highly on the query seo?Web search engines have policies on SEO practices they tolerate/blockSee pointers in ResourcesAdversarial IR: the unending (technical) battle between SEOs and web search enginesSee for instance http://airweb.cse.lehigh.edu/

Crawling

Crawling IssuesHow to crawl? Quality: Best pages firstEfficiency: Avoid duplication (or near duplication)Etiquette: Robots.txt, Server load concerns

How much to crawl? How much to index?Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have?

How often to crawl?Freshness: How much has changed? How much has really changed? (why is this a different question?)

Basic crawler operationBegin with known seed pagesFetch and parse themExtract URLs they point toPlace the extracted URLs on a queueFetch each URL on the queue and repeat

Simple picture complicationsWeb crawling isnt feasible with one machineAll of the above steps distributedEven non-malicious pages pose challengesLatency/bandwidth to remote servers varyRobots.txt stipulationsHow deep should you crawl a sites URL hierarchy?Site mirrors and duplicate pagesMalicious pagesSpam pages (Lecture 1, plus others to be discussed)Spider traps incl dynamically generatedPoliteness dont hit a server too often

Robots.txtProtocol for giving spiders (robots) limited access to a website, originally from 1994www.robotstxt.org/wc/norobots.htmlWebsite announces its request on what can(not) be crawledFor a URL, create a file URL/robots.txtThis file specifies access restrictions

Robots.txt exampleNo robot should visit any URL starting with "/yoursite/temp/", except the robot called searchengine":

User-agent: *Disallow: /yoursite/temp/

User-agent: searchengineDisallow:

Crawling and Corpus ConstructionCrawl orderDistributed crawlingFiltering duplicatesMirror detection

Where do we spider next?

Crawl OrderWant best pages firstPotential quality measures:Final In-degree Final Pagerank

Crawl OrderWant best pages firstPotential quality measures:Final In-degree Final PagerankCrawl heuristic:Breadth First Search (BFS)Partial IndegreePartial Pagerank Random walk

BFS & Spam (Worst case scenario)BFS depth = 2

Normal avg outdegree = 10

100 URLs on the queue including a spam page.

Assume the spammer is able to generate dynamic pages with 1000 outlinksStartPageStartPageBFS depth = 32000 URLs on the queue50% belong to the spammer

BFS depth = 41.01 million URLs on the queue99% belong to the spammer

Where do we spider next?

Where do we spider next?Keep all spiders busyKeep spiders from treading on each others toesAvoid fetching duplicates repeatedlyRespect politeness/robots.txtAvoid getting stuck in trapsDetect/minimize spamGet the best pagesWhats best?Best for answering search queries

Where do we spider next?Complex scheduling optimization problem, subject to all the constraints listedPlus operational constraints (e.g., keeping all machines load-balanced)Scientific study limited to specific aspectsWhich ones?What do we measure?What are the compromises in distributed crawling?

Parallel CrawlersWe follow the treatment of Cho and Garcia-Molina:http://www2002.org/CDROM/refereed/108/index.htmlRaises a number of questions in a clean setting, for further studySetting: we have a number of c-procsc-proc = crawling processGoal: we wish to spider the best pages with minimum overheadWhat do these mean?

Distributed modelCrawlers may be running in diverse geographies Europe, Asia, etc.Periodically update a master indexIncremental update so this is cheapCompression, differential update etc.Focus on communication overhead during the crawlAlso results in dispersed WAN load

c-procs crawling the webURLs crawledURLs inqueuesCommunication: by URLspassed between c-procs.

MeasurementsOverlap = (N-I)/I whereN = number of pages fetchedI = number of distinct pages fetchedCoverage = I/U whereU = Total number of web pagesQuality = sum over downloaded pages of their importanceImportance of a page = its in-degreeCommunication overhead =Number of URLs c-procs exchange

Crawler variationsc-procs are independentFetch pages oblivious to each other.Static assignmentWeb pages partitioned statically a priori, e.g., by URL hash more to followDynamic assignmentCentral co-ordinator splits URLs among c-procs

Static assignmentFirewall mode: each c-proc only fetches URL within its partition typically a domaininter-partition links not followedCrossover mode: c-proc may following inter-partition links into another partitionpossibility of duplicate fetchingExchange mode: c-procs periodically exchange URLs they discover in another partition

Experiments40M URL graph Stanford WebbaseOpen Directory (dmoz.org) URLs as seedsShould be considered a small Web

Summary of findingsCho/Garcia-Molina detail many findingsWe will review some here, both qualitatively and quantitativelyYou are expected to understand the reason behind each qualitative finding in the paperYou are not expected to remember quantities in their plots/studies

Firewall mode coverageThe price of crawling in firewall mode

Crossover mode overlapDemanding coverage drives up overlap

Exchange mode communicationCommunication overhead sublinearPerdownloadedURL

Connectivity servers

Connectivity Server[CS1: Bhar98b, CS2 & 3: Rand01]Support for fast queries on the web graphWhich URLs point to a given URL?Which URLs does a given URL point to?Stores mappings in memory fromURL to outlinks, URL to inlinksApplicationsCrawl controlWeb graph analysisConnectivity, crawl optimizationLink analysisMore on this later

Most recent published workBoldi and Vignahttp://www2004.org/proceedings/docs/1p595.pdfWebgraph set of algorithms and a java implementationFundamental goal maintain node adjacency lists in memoryFor this, compressing the adjacency lists is the critical component

Adjacency listsThe set of neighbors of a nodeAssume each URL represented by an integerProperties exploited in compression:Similarity (between lists)Locality (many links from a page go to nearby pages)Use gap encodings in sorted listsDistribution of gap values

StorageBoldi/Vigna get down to an average of ~3 bits/link(URL to URL edge)For a 118M node web graphHow?

Why is this remarkable?

Main ideas of Boldi/VignaConsider lexicographically ordered list of all URLs, e.g., www.stanford.edu/alchemywww.stanford.edu/biologywww.stanford.edu/biology/plantwww.stanford.edu/biology/plant/copyrightwww.stanford.edu/biology/plant/peoplewww.stanford.edu/chemistry

Boldi/VignaEach of these URLs has an adjacency listMain thesis: because of templates, the adjacency list of a node is similar to one of the 7 preceding URLs in the lexicographic orderingExpress adjacency list in terms of one of theseE.g., consider these adjacency lists1, 2, 4, 8, 16, 32, 641, 4, 9, 16, 25, 36, 49, 641, 2, 3, 5, 8, 13, 21, 34, 55, 89, 1441, 4, 8, 16, 25, 36, 49, 64Encode as (-2), remove 9, add 8Why 7?

Resourceswww.robotstxt.org/wc/norobots.htmlwww2002.org/CDROM/refereed/108/index.htmlwww2004.org/proceedings/docs/1p595.pdf

Example of DNS spam: *.thrillingpage.info