• Published on

  • View

  • Download

Embed Size (px)


SEARCH ENGINE INSIDE OUT. r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000. From Technical Views. Outline. Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing Subsystem Search Interface Future Trends - PowerPoint PPT Presentation


  • SEARCH ENGINE INSIDE OUT r86526020 r88526016 r88526028 b85506013 b85506010

    April 11,2000 From Technical Views

  • OutlineWhy Search Engine so importantSearch Engine ArchitectureCrawling SubsystemIndexing SubsystemSearch InterfaceFuture TrendsDiscussion

  • Statistics1 in every 28 page views on the Web is a search result pages. (June 1,1999, Alexa Insider)

    The most widely traveled path on the web in March 1999 was from to . (March 1999, Alexa Insider)

    The average work user spends 73 minutes per month at search engines, second only to 97 minutes at news, info and entertainment sites. (Feb,1999,Internet World)

    Almost 50% of online users turn to search sites for their online news needs. (Dec. 1998, Jupiter)

  • StatisticsSource: IMP Strategies, Feb, 21. 2000

  • StatisticsUnit : millions/dayHow Many Searches are performedInktomi (Jan. 2000)38Google (Apr. 2000)12AskJeeves (Mar. 2000)4Voila (Jan. 2000)1.2Total Search estimated94Take Inktomi for example, it should accepts 440 queries each second.

  • TaxonomyGeneral-purpose Search EngineAltavista,Excite,Infoseek,Lycos,HotBot,...Hierarchical DirectoryYahoo,Open Directory,LookSmart,...Meta Search EngineMetaCrawler,DogPile,SavvySearch, ...Question-AnsweringAskJeevesSpecialized Search EnginesHomePage Finder,Shopping robots,RealName,......

  • Architecture

  • ComponentsSpiderSpiders crawl the web, collect the documents through what they have found. IndexerProcess and make a logical view of the data.Search Interface Accept user queries and search through the indexdatabase. Also, rank the result listing and represent to the user.

  • Crawling SubsystemSpider (URL) { #Use the HTTP protocol get method to acquire the web page Set HttpConnection = HTTPGet(URL); #Verify that information is accurate and not a 404 error Set Content = CheckInformation(HttpConnection); #Place the information into a database for later processing StoreInformation(Content);}

  • 300250200150100 50 0Measurement of Indexed Pages3002502142112541381105050FASTAltaVistaExciteNorthern LightGoogleInktomiGoLycosUnit : MillionReport Date: Feb.3,2000

  • 30%25%20%15%10% 5% 0%Coverage of the Web38%31%27%26%32%17%14%6%6%FASTAltaVistaExciteNorthern LightGoogleInktomiGoLycos35%40%(Est. 1 billion total pages)Report Date: Feb.3,2000

  • Issues for Crawling (1/3)Web Exploration with PriorityDecisions about which site(page) is explored firstEnsuring document quality and coverageUse Random , BFS, DFS (+depth limits) with priorityDuplicationsHost-wise duplicationsNear 30% of the web are syntactically duplicated?? are semantically duplicated.Single Host duplicationsThe same website with different host nameSymbolic links will cause some infinite routes in the web graphUse Fingerprint, limited-depth explorationDynamic DocumentsWhether retrieve dynamic documents or not ?Single dynamic document with different parameters ?!

  • Issues for Crawling (2/3)Load BalanceInternalResponse time, size of answers are unpredictableThere are additional system constraints (# threads,# open connections, etc)ExternalNever overload websites or network links (A well-connected crawler can saturate the entire outside bandwidth of some small country)Support robot standard for politeness. Storage ManagementHuge amount of url/document dataFreshnessMany web sites(pages) changes oftenly, others nearly remains unchangedRevisit different website with different periods.

  • Issues for Crawling (3/3)The Hidden WebSome websites are not popular but valuableUse Fast DNS search for possible explorations. Sample Architecture of Crawling System (Adapted from a topic-specific crawler )

  • Indexer SubsystemIndex(content,URL) { #Search each needed HTML structure Set Head=GetHtmlHead(content); Set Title=GetHtmlTitle(content); Set Keywords=GetHtmlKeyword(content); #Get needed keywords Loop { Set Object = CreateObject(Keywords,Title,Head,URL); #Store the keyword, and make internal representation StoreKeyword(Object,keyword); }}

  • Text Operations QueryOperationsLogical ViewLogical ViewSearching UserInterfaceTextTextInverted FileUser FeedbackRanked DocsUser NeedQueryRetrieved Docs User Interface

  • Logic View of Docs and Queries from Vector Space ModelDocuments and Queries are treated as a t-dimension vectorst is the dimension of the whole index term space.Each vector component is the weight for relevance factor for a specific index term.Typical measurement for relevance

  • Logic View of Docs and Queries from Vector Space ModelTypical weighting scheme TFxIDF

    Typical Effectiveness Measurement Recall/Precision

    Fi,j : termis frequency in documentjN : total number of documents Ni : total number of occurrence in different documentsrelevantretrievalDocument spaceA B CRecall = the fraction of the relevant docu- ments which has been retrievedPrecision = the fraction of the retrieved docu- ments which is relevant

  • Inverted IndexDocument ID = 1Document ID = 2creationsearchDocument 2

  • TextSuffix TrieSuffix TreeSuffix ArraySupra-Index

  • This is a text. A text has many words. Words are made from letters.Text00101111101 Nlog. blockssignature file F bitspointer filetext fileParameterD logical blockF signature size in bitsm number of bits per wordFd false drop probability

  • Issues for Indexing (1/2)Language IdentificationDocuments with different languages should be unified into a meta-representation.Code conversion without concept lose.How to identify language type use meta data (charset, content-encoding) if available. statistical approaches to identify language typeStorage ManagementHuge amount of indexes can not be loaded in the memory totallyUse cache mechanism, fast secondary storage accessEfficient database structuresUsing Compression ?! Speed and Storage tradeoff

  • Issues for Indexing (2/2)Text OperationsFull text or controlled vocabulary Stop list, Stemming, Phrase-level indexing, ThesaurusConcept discovery, Directory establishment, CategorizationSupport search with fault tolerances ?!Query-independent rankingWeighting scheme for query-independent rankingWeb graph representation manipulationsStructure information reservationDocument author, creation time, title, keywords,

  • Search SubsystemReport (query) { #Get all relevant URLs in the internal database Set Candidates = GetRelevantDocuments(query); #Rank the lists according to its relevance scores Set Answer = Rank(Candidates); #Format the result DisplayResults(); }

  • What makes Web UsersSo DifferentMake poor queriesShort queries (2.35 terms for English, 3.4 characters for Chinese)Imprecise termsSub-optimal syntax (80% queries without operator)Wide variance inNeeds (Some are looking for proper noun only)ExpectationsKnowledgeBandwidthSpecific behavior 85% look over one result screen only 78% of queries are not modifiedFollow links

  • RankingGoalorder the answer set to a query in decreasing order of valueTypesQuery-independent : assign an intrinsic value to a document, regardless of the actual queryQuery-dependent : value is determined only with respect to a particular queryMixed : combination of both valuationsExamplesQuery-independent : length, vocabulary, publication data, number of citations(indegree), etcQuery-dependent : cosine measurement

  • Some ranking criteriaContent-based techniquesVariant of term vector model or probabilistic modelAd-hoc factorsAnti-porn heuristics, publication/location dataHuman annotationsConnectivity-based techniquesQuery-independentPageRank [PBMW 98, BP 98] , indegree [CK97] Query-dependent HITS [K98]

  • Connectivity-Based RankingPageRankConsider a random Web surferJumps to random page with probability With probability 1 - , follows a random hyperlinkTransition probability matrix is x U + (1- ) x A where U is the uniform distribution and A is adjacency matrixQuery-independent rank = stationary probability for this Markov chain PR( a ) = + (1 - ) PR( Pi ) / C( Pi )

    Crawling the Web using this ordering has been shown to be better than other crawling schemes.

  • Practical Systems - AltavistaAltavista configuration 98Crawler - Scooter 1.5 GB memory30 GB RAID disk4x533 MHz AlphaServer1 GB/s I/O bandwidthIndexing Engine Vista2 GB memory180 GB RAID disk2x533 MHz AlphaServerSearch Engine Altavista20 multi-processor machines130 GB memory500 GB RAID disk

  • Practical Systems - GoogleThe power of PageRank

  • Future TrendsMulti-lingual/Cross-Lingual Information RetrievalAnother way toward concept-oriented searchingWeb MiningWeb content mining : customer behavior analysis, advertismentWeb usage mining : web query log analysisPersonalized Search AgentsInformation filtering, information routingMore accurate user concept hierarchy mappingTopic-specific knowledge base creationQuestion-Answering systemIntelligent e-ServiceUser modeling research