View
215
Download
0
Category
Tags:
Preview:
Citation preview
Web Search
Dr. Yingwu Zhu
Overview
• History• Search Engine Architecture• Web Spam
Search Engine Early History
• By late 1980’s many files were available by anonymous FTP.
• In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”)– Assembled lists of files available on many
FTP servers.– Allowed regex search of these file names.
• In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.
Web Search History
• In 1993, early web robots (spiders) were built to collect URL’s:– Wanderer– ALIWEB (Archie-Like Index of the WEB)– WWW Worm (indexed URL’s and titles
for regex search)• In 1994, Stanford grad students
David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.
Web Search History (cont)• In early 1994, Brian Pinkerton developed
WebCrawler as a class project at U Wash. (eventually became part of Excite and AOL).
• A few months later, Fuzzy Maudlin, a grad student at CMU developed Lycos. First to use a standard IR system as developed for the DARPA Tipster project. First to index a large set of pages.
• In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large numbers of queries. Supported boolean operators, phrases, and “reverse pointer” queries.
Web Search Recent History
• In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google. Main advance is use of link analysis to rank results partially based on authority.– Pagerank
History1. FTP
2. FTP + Search
3. Web crawlers: URLs
4. Yahoo
5. Indexing + Search
6. Ranking: Google
Search Landscape 2005
• Four major “Mainframes”– Google,Yahoo, MSN, and ASK
• >450M searches daily– 60% international– Thousands of machines
• $8+B in Paid Search Revenues• Large indices
– Billions of documents– Terrabytes of data
• Excellent relevance– For some tasks
Source: Search Engine Watch
Overview
• History• Search Engine Architecture• Web Spam
Characteristics of Web search
• Huge amounts of text to search through
• Pages are linked• Pages differ greatly in quality• A single search may return many
pages– A user will not look at all result pages– Result pages need to be ranked– Complete result set may be unnecessary
Slide adapted from Lew & Davis
How Search Engines Work
Do you know how it works?Architecture?
Slide adapted from Lew & Davis
How Search Engines Work
1. Gather the contents of all web pages (using a program called a crawler or spider)
2. Organize the contents of the pages in a way that allows efficient retrieval (indexing)
3. Take in a query, determine which pages match, and show the results (ranking and display of results)
Three main parts:
Standard Web Search Engine Architecture
crawl theweb
Create an inverted
index
Check for duplicates,store the
documents
Inverted index
Search engine servers
DocIdsCrawlermachines
Standard Web Search Engine Architecture
crawl theweb
Create an inverted
index
Check for duplicates,store the
documents
Inverted index
Search engine servers
userquery
Show results To user
DocIdsCrawlermachines
Search Engine Architecture
WWW Crawl
Snapshot
Indexer
Web Map
Meta data
Query Serving
Web Index
Ranking and Presentation Comprehensiveness and Freshness
Comprehensiveness
• Problem:– Make accessible all useful Web pages
• Issues:– Web has an infinite number of pages– Finite resources available
• Bandwidth• Disk capacity
• Selection Problem– Which pages to visit
• Crawl Policy– Which pages to index
• Index Selection Policy
Freshness
• Problem: – Ensure that what is indexed correctly
reflects current state of the web• Impossible to achieve exactly
– Revisit vs Discovery• Divide and Conquer
– A few pages change continually– Most pages are relatively static
Ranking
• Problem:– Given a well-formed query, place the most
relevant pages in the first few positions
• Issues:– Scale: Many candidate matches
• Response in < 100 msecs
– Evaluation:• Editorial • User Behavior
Overview
• History• Search Engine Architecture
– Crawler or Spider– Indexing– Ranking
• Web Spam
Crawler
• How does a crawler work?• How to design a crawler?• What need to be considered in
design?
Web Crawlers
• How do the web search engines get all of the items they index?
• Main idea: – Start with known sites– Record information for these sites– Follow the links from each site– Record information found at new sites– Repeat
What is a Crawler?
web
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
Web Crawling Algorithm• More precisely:
– Put a set of known sites on a queue– Repeat the following until the queue is empty:
• Take the first page off of the queue• If this page has not yet been processed:
– Record the information found on this page»Positions of words, links going out, etc
– Add each link on the current page to the queue
– Record that this page has been processed• Rule-of-thumb: 1 doc per minute per crawling server
Crawl Policy
• Pages found by following links– From an initial root set
• Basic iteration:– Visit pages and extract links– Prioritize next pages to visit (or revisit)
• Framework– Visit pages
• most likely to be viewed • most likely to contain links to pages that will be
viewed
– Prioritization by Query-independent Quality
Slide adapted from Lew & Davis
Crawler behaviour varies
•Parts of a web page that are indexed•How deeply a site is indexed •Types of files indexed•How frequently the site is spidered
The behavior of a web crawler is the outcome of a combination of
policies • A selection policy that states which
pages to download. • A re-visit policy that states when to
check for changes to the pages. • A politeness policy that states how to
avoid overloading websites. • A parallelization policy that states how
to coordinate distributed web crawlers
Four Laws of Crawling
• A Crawler must show identification– A crawler must identify itself using the
User-agent field of an HTTP request
• A Crawler must obey the robots exclusion standardhttp://www.robotstxt.org/wc/norobots.html
• A Crawler must not hog resources• A Crawler must report errors
Lots of tricky aspects
• Servers are often down or slow• Hyperlinks can get the crawler into cycles• Some websites have junk in the web pages• Now many pages have dynamic content
– The “hidden” web– E.g., schedule.xxx.edu
• You don’t see the course schedules until you run a query.
• The web is HUGE
Web Crawling Issues• Keep out signs
– A file called norobots.txt lists “off-limits” directories– Freshness: Figure out which pages change often,
and recrawl these often.• Duplicates, virtual hosts, etc.
– Convert page contents with a hash function– Compare new pages to the hash table
• Lots of problems– Server unavailable; incorrect html; missing links;
attempts to “fool” search engine by giving crawler a version of the page with lots of spurious terms added ...
• Web crawling is difficult to do robustly!
Crawling order
• Want to visit best pages first.• Need a measure of quality (in-degree,
PageRank).• Possible Orderings
– Breadth-first search (FIFO)– In-degree (so far)– PageRank (so far)– Random
• Experiments suggest breadth-first search finds pages with high PageRank early (removes need for computation).
Overview
• History• Search Engine Architecture
– Crawler or Spider– Indexing– Ranking
• Web Spam
Indexing
• Indexing using IR techniques, producing inverted files for web pages
• Vector space model (VSM)
How Inverted Files Are Created
• Periodically rebuilt, static otherwise.• Documents are parsed to extract tokens.
These are saved with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
How Inverted Files are Created
• After all documents have been parsed the inverted file is sorted alphabetically.
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
How InvertedFiles are Created
• Multiple term entries for a single document are merged.
• Within-document term frequency information is compiled.
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
How Inverted Files are Created
• Finally, the file can be split into – A Dictionary or Lexicon file and – A Postings file
How Inverted Files are Created
Dictionary/Lexicon PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
PageInverted file
• A vector of terms• Stop words removed• All words are stemmed
Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:
– document ID – frequency of term in doc (optional) – position of term in doc (optional)– font size (optional)– Capitalization (optional)– descriptor type, e.g. title, anchor, etc (optional)
• These lists can be used to solve Boolean queries:
• country -> d1, d2• manor -> d2• country AND manor -> d2
• Also used for statistical ranking algorithms
Inverted Indexes for Web Search Engines
• Inverted indexes are still used, even though the web is so huge.
• Some systems partition the indexes across different machines. Each machine handles different parts of the data.
• Other systems duplicate the data across many machines; queries are distributed among the machines.
• Most do a combination of these.
Standard Web Search Engine Architecture
crawl theweb
Create an inverted
index
Check for duplicates,store the
documents
Inverted index
Search engine servers
userquery
Show results To user
DocIdsCrawlermachines
Query Serving Architecture
• Index divided into segments each served by a node
• Each row of nodes replicated for query load
• Query integrator distributes query and merges results
• Front end creates a HTML page with the query results
Load Balancer
FE1
QI1
Node1,1 Node1,2 Node1,3 Node1,N
Node2,1 Node2,2 Node2,3 Node2,N
Node4,1 Node4,2 Node4,3 Node4,N
Node3,1 Node3,2 Node3,3 Node3,N
QI2 QI8
FE2 FE8
“travel”
“travel”
“travel”
“travel”
“travel”
…
…
…………
…
…
Overview
• History• Search Engine Architecture
– Crawler or Spider– Indexing– Ranking
• Web Spam
How to do ranking?
Ranking result pages
• Based on content– Number of occurrences of the search
terms– Similarity to the query text
• Based on link structure– Backlink count– PageRank– Hub and authority scores (HITS)
Problems with Content-based Ranking?
Problems with content-based ranking
• Many pages containing search terms may be of poor quality or irrelevant– Example: a page with just a line “search engine”.
• Many high-quality or relevant pages do not even contain the search terms– Example: Google homepage
• Page containing more occurrences of the search terms are ranked higher; spamming is easy– Example: a page with line “search engine”
Repeated many times
Backlink
• A backlink of a page p is a link that points to p
• A page with more backlinks is ranked higher
• Intuition: Each backlink is a “vote” for the page’s importance
• Based on local link structure; still easy to spam– Create lots of pages that point to a particular
page
PageRank and HITS
• Page et al., “The PageRank Citation Ranking: Brining Order to the Web.” 1998
• Kleinberg, “Authoritative Sources in a Hyperlinked Environment.” Journal of the ACM, 1999
• Main idea: Pages pointed by high-ranking pages are ranked higher
• Definition is recursive by design• Based on global link structure; hard
to spam
Slide adapted from Manning, Raghavan, & Schuetze
Manipulating Ranking
• Motives– Commercial, political, religious, lobbies– Promotion funded by advertising budget
• Operators– Contractors (Search Engine Optimizers) for
lobbies, companies– Web masters– Hosting services
• Forum– Web master world ( www.webmasterworld.com )
Overview
• History• Search Engine Architecture
– Crawler or Spider– Indexing– Ranking
• Web Spam
Slide adapted from Manning, Raghavan, & Schuetze
A few spam technologies
• Cloaking– Serve fake content to search engine robot– DNS cloaking: Switch IP address. Impersonate
• Doorway pages– Pages optimized for a single keyword that re-direct
to the real target page
• Keyword Spam– Misleading meta-keywords, excessive repetition of
a term, fake “anchor text”– Hidden text with colors, CSS tricks, etc.
• Link spamming– Mutual admiration societies, hidden links, awards– Domain flooding: numerous domains that point or
re-direct to a target page
• Robots– Fake click stream– Fake query stream– Millions of submissions via Add-Url
Is this a SearchEngine spider?
Y
N
SPAM
RealDoc
Cloaking
Meta-Keywords = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”
Cloaking
• Content presented to the search engine spider is different from that presented to the users' browser.
• This is done by delivering content based on the IP addresses or the User-Agent HTTP header of the user requesting the page.
• When a user is identified as a search engine spider, a server-side script delivers a different version of the web page, one that contains content not present on the visible page.
• The purpose of cloaking is to deceive search engines so they display the page when it would not otherwise be displayed
Cloaking
• Cloaking is often used as a spamdexing technique, to try to trick search engines into giving the relevant site a higher ranking;
• it can also be used to trick search engine users into visiting a site based on the search engine description which site turns out to have substantially different, or even pornographic content redirection!
• Search engines delist sites when deceptive cloaking is reported.
• Cloaking is a form of the doorway page technique.
Redirection
• Simple approach: take advantage of the refresh meta tag in the header of an HTML document, by setting refresh time to zero and the target page, spammers can achieve redirection as soon as the page gets loaded into the browsers <meta http-equiv=“refresh” content=
“0;url=target.html”>
• Search engines can easily detect it!
Redirection
• Using scripts, which are not executed by the web crawlers<script language=“javascript”><!--
location.replace(“target.html”)
--></script>
Doorway Pages
• Creating low-quality web pages that contain very little content but are instead stuffed with very similar keywords and phrases.
• They are designed to rank highly within the search results, but serve no purpose to visitors looking for information. A doorway page will generally have "click here to enter" on the page.
• Once they are reported, Search engines delist the sites!
Keyword Spam• Keyword spam is the excessive repetition of keywords
on a page • It is usually done using hidden elements that are
indexed by search engines but are not visible to users including Title, Meta, and Alt. – Black-hatters have found that they can disguise keywords in
the contents of the page by making the text the same color as the background and tucking it away at the bottom of the page
– CTRL-A to highlight all the text on a page, get caught!– MSN Search claims to automatically penalize these pages.
• E.g., <meta name="keywords" content="wikipedia,encyclopedia"/>, specifies the document is relevant to wikipedia, encyclopedia!
Keyword Spam• An extension on the hidden text idea is to hide
the keyword spam using style-sheets (CSS). – This gives the spammer great scope for stuffing
keywords into important elements such as Headings without them being noticed. The following style will format all Heading 1 text as 1pt high white text.
H1 { font-size : 1pt; color : white; }
• Many other ways of hiding content from users such as Layers and IFrames while still having it visible to search engines.
• Search engines can detect them at the cost of slowdown by paring style sheets and other structures!
Link Spam
• Takes advantage of link-based ranking algorithms, such as Google's PageRank algorithm and HITS algorithms, which gives a higher ranking to a website the more other highly ranked websites link to it
• Links farms: Involves creating tightly-knit communities of pages referencing each other, also known humorously as mutual admiration societies
• Page hijacking: This is achieved by creating a rogue copy of a popular website which shows contents similar to the original to a web crawler but redirects web surfers to unrelated or malicious websites.
Recommended