Web Search Engines and
Information Retrieval on the World-Wide Web
Torsten SuelCIS Department
[email protected]://cis.poly.edu/suel
Overview:• introduction and motivation
• research: improving cluster-based search engines
• research: future peer-to-peer search engine architectures
Web search engines:
1. Introduction and Motivation
Basic structure of a search engine:
Crawler
disks
Index
indexing
Search.comQuery: “computer”
look up
1. Introduction and Motivation (cont.)
• coverage (need to cover large part of the web)
• good ranking (in the case of broad queries)
• freshness (need to update content)
• user load (up to 10000 queries/sec - Google)
• manipulation (sites want to be listed first)
Challenges for search engines:
need to crawl and store massive data sets
smart information retrieval techniques
frequent recrawling of content
many queries on massive data
most techniques will be exploited quickly
1. Introduction and Motivation (cont.)
• more than 3 billion web pages and 10 million web sites
• need to crawl, store, and process terabytes of data
• 10000 queries / second (Google)
• cluster of more than 5000 Linux servers (Google)
• “planetary-scale web service”
(google, hotmail, yahoo, aol web caches, akamai)
• proprietary code and secret recipes
1. Introduction and Motivation (cont.)
Other types of web search tools
• Web directories (yahoo, open directory project)
• Specialized search engines (cora, citeseer, achoo, findlaw)
• Local search engines (for one site)
• Meta search engines (dogpile, mamma, search.com)
• Personal search assistants (alexa, google toolbar)
• Image search (ditto, visoo)
• Database search (completeplanet, brightplanet)
1. Introduction and Motivation (cont.)
• trademark and copyright enforcement - track down mp3 and video files
- track down images with logos (Cobion)
• comparison shopping and auction bots• competitive intelligence• national security: monitoring certain websites
Data collection, extraction & mining tools
• Example: Whizbang job database:
- collects job announcements on company web sites
- focused crawling to track down job annoucements
- sorts job announcements by type, locations, etc.
1. Introduction and Motivation (cont.)
algorithms
systemsinformation retrieval
databases
machine learning
natural languageprocessin
g
AI
1. Introduction and Motivation (cont.)
• efficiency and scaling with query load - per-node performance - scaling cluster size
• data size and scaling with the web - data acquisition: crawling and refresh - index size and performance - index updates
• better ranking for improved results - link-based ranking
- topic- and context-specific ranking
2. Cluster-Based Search Engines
Research Challenges:
Polybot crawler: (with Vlad Shkapenyuk)
• scalable web crawler• runs on cluster of servers• 300 pages/sec (and beyond)
Storage and Indexing: (Alex Okulov and Xiaohui Long)
high-speedLAN or SAN
• storing and indexing terabytes on network of workstations • fast compression techniques for storage• index performance and index updates• index partitioning
Linux servers with several
disks each
• Ragerank (Brin&Page/Google)
“significance of a page
depends on significance
of those referencing it”
• improving link-based ranking• integration of term- and link-based methods
Link-based ranking (Yenyu Chen and Qingqing Gan)
Future Search Engines and Search Tools• expect powerful user interfaces beyond browser - browsing assistants - search and navigation tools
• many more search engine accesses• most access programmatic in nature• idea: split search engine into upper and lower tier - lower tier: crawling, indexing, index queries (dumb, big data) - upper tier: ranking, interface, analysis (smart stuff)
• idea: lower layer as highly distributed substrate to support search and navigation tools - open and agnostic “let a thousand flowers bloom”
- scalable “let a million queries fly”
2. Peer-to-peer Search Engine Architectures
P2P web search architecture:
• thousands of powerful machines all over the internet• machines can join or leave• agnostic: can implement many IR methods on top
searchengine
searchengine
searchengine
searchengine
West Exploration and Search Technology Lab:
• about 10 grad and undergrad students• more information: http://cis.poly.edu/westlab• courses on web search, IR, web protocols
Showcase slides at http://cis.poly.edu/showcase/