Upload
chinna-botla
View
758
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
How Search Engines Work
Presentation by
Chinna
What is Search Engine
Search engine is a software program that searches for sites based on the words that you
designate as search terms.
"Search engine" is the popular term for an Information Retrieval (IR) system.
2
Motto of search engines
A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other types of files.
3
4
Purpose of Search Engines
Helping people find what they’re looking for• Starts with an "information need"• Convert to a query• Gets results
In the materials available• Web pages• Other formats• Deep Web
HISTORY
Archie – First search tool for the Internet
Gopher – indexed plain text documents
Jughead – searched the files stored in Gopher index systems
Wandex – First Web search engine
5
How web search engines work
search engine operates in the following order:
Web CrawlingIndexing
Searching
6
7
How do Search Engine Works Spiders
Robots
8
Search is Not a Panacea
Search can’t find what’s not there• The content is hugely important
Information Architecture is vitalUsable sites have good navigation
and structure
Search Engine Modules
A query processorA search and matching functionA ranking capabilitySummarizing and Presenting
documents.
9
Search Engines Mode of Working in Earlier Days
From 1990-1998 (1st Generation of search tools): • Looked at title of web pages• Ranking was based on page content
• Looked at number of times the search term appeared on the page
• Looked at metatags
10
SEO (Search Engine Optimization)
Used by companies to get a higher result in search engines
White hat: Using legitimate techniques
Black hat: Using illegal techniques to trick the search engine, like paying sites to link to you.
11
12
Search Processing
13
Search is Only as Good as the Content
Users blame the search engine • Even when the content is unavailable
Understand the scope of site or intranet• Kinds of information• Divided sites: products / corporate info• Dates• Languages• Sources and data silos: databases...• Update processes
14
Making a Searchable Index
Store text to search it laterMany ways to gather text
• Crawl (spider) via HTTP• Read files on file servers• Access databases (HTTP or API)• Data silos via local APIs• Applications, CMSs, via Web Services
Security and Access Control
15
Robot Indexing Diagram
Source:James Ghaphery, VCU
16
What the Index Needs
Basic information for document or record• File name / URL / record ID• Title or equivalent• Size, date, MIME type
Full text of item More metadata
• Product name, picture ID• Category, topic, or subject• Other attributes, for relevance ranking and
display
17
Simple Index Diagram
18
Index Issues
StopwordsStemmingMetadata
• Explicit (tags)• Implicit (context)
Semantics• CMS and Database fields• XML tags and attributes
19
Search Query Processing
What happens after you click the search button, and before retrieval starts.
Usually in this order• Handle character set, maybe language• Look for operators and organize the query• Look for field names or metadata• Extract words (just like the indexer)• Deal with letter casing
20
Search and Retrieval
Retrieval: find files with query termsNot the same as relevance ranking
Recall: find all relevant items
Precision: find only relevant items
Increasing one decreases the other
21
Retrieval = Matching
Single-word queries• Find items containing that word
Multi-word queries: combine lists• Any: every item with any query word• All: only items with every word• Phrases: find only items with all words in
orderBoolean and complex queries
• Use algorithm to combine lists
22
Why Searches Fail
Empty searchNothing on the site on that topic
(scope)Misspelling or typing mistakesVocabulary differencesRestrictive search defaultsRestrictive search choicesSoftware failure
23
Relevance Ranking
Theory: sort the matching items, so the most relevant ones appear first
Can't really know what the user wants Relevance is hard to define and situationalShort queries tend to be deeply ambiguous
• What do people mean when they type “bank”?First 10 results are the most important
24
Relevance Processing
Sorting documents on various criteriaStart with words matching query termsCitation and link analysis
• Like old library Citation Indexes• Not only hypertext, but the links• Google PageRank
• Incoming links• Authority of linkers
Taxonomies and external metadata
25
Search Results Interface
What users see after they click the Search button
The most visible part of searchElements of the results page
• Page layout and navigation• Results header• List of results items• Results footer
26
Search Suggestions
Human judgment beats algorithmsGreat for frequent, ambiguous searches
• Use search log to identify best candidatesRecommend good starting pages
• Product information, FAQs, etc.
Requires human resources• That means money and time
More static than algorithmic search
27
Search Metrics
Number of searchesNumber of matches searches
Traffic from search to high-value pages Relate search changes to other metrics
Query Example
Consider the Query Mahendra Singh Dhoni
A good answer contains all the three words, and more frequently the better, we call this Term Frequency(TF)
Some Query terms are more important those have better discriminating power than others
For example an answer containing only "Dhoni" is likely to be better than an answer containing only “Mahendra“We call this Inverse Document Frequency (IDF)
28
29
Search Will Never Be Perfect
Search engines can’t read minds• User queries are short and ambiguous
Some things will help• Design a usable interface • Show match words in context• Keep index current and complete• Adjust heuristic weighting• Maintain suggestions and synonyms• Consider faceted metadata search