Introduction to Search Engines
Search Engine Overview
User Intermediary Information
What am I looking for?- Identification of info. need
What question do I ask?- Query formulation
What is the searcher looking for?- Discovery of user’s info. need
How should the question be posed? - Query representation
Where is the relevant information?- Query-document matching
What data to collect?- Collection development
What information to index?- Indexing/RepresentationHow to represent it?- Data structure
Search Engines 2
Searchable Index(색인)
Query(질의)
Search Results
1
23
0
Search Data (0)(1) Query Indexing(2) Document Ranking(3) Result Display
1. Document Collection- e.g., spider/crawler
2. Document Indexing- term indexing
(tokenizing, stop & stem)- term weighting
Search Engine: Data Document Collection
► Select target data sources – e.g., domain, corpus, WWW
► Harvest data – e.g., data entry, data import, spider/crawler
Document Indexing► Select indexing sources (색인어) – e.g., metadata, keywords, content
► Extract indexing terms – e.g., tokenization, stop & stem
► Assign term weights – e.g., tf-idf, okapi
Search Engines 3
“The frequency of word occurrence in an article furnishes a useful measurement of word significance.”- 문헌에출현한단어들은문헌의내용분석을위해사용될수있으며, 단어의
출현빈도가이단어의주제어로서의중요성을측정하는기준이된다 .
Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165.
TokensTokens
Search Engine: Indexing Process
Search Engines 4
Documents(Text)
Tokenization
Token Selection
Token Normalization
Tokens
TokensTokensSelectTokens
TokensTokensSEQUENTIALINDEX
Term Weighting
INVERTEDINDEX
D1 D2 D3
wd1 (information) 1 1 1
wd2 (model) 0 1 1
wd3 (retrieval) 1 2 0
wd4 (seminar) 1 0 0
D1: Information retrieval seminarsD2: Retrieval Models and Information RetrievalD3: Information Model D1 information 1, retrieval 1, seminar 1
D2 information 1, model 1, retrieval 2
D3 information 1, model 1D1: information, retrieval, seminar(s)D2: retrieval, model(s), and, information, retrievalD3: information, model
Search Engine: Search Query Indexing
► Tokenization► Stop & Stem► Term Weighting
Document Ranking► Query-Document matching► Document Score computation
Result Display► Content - e.g., title & snippets
► Layout - e.g., grouped by category
► Toppings - e.g., related searches
Search Engines 5
Index Term D1 D2 D3
wd1 (information) 1 1 1
wd2 (model) 0 1 1
wd3 (retrieval) 1 2 0
wd4 (seminar) 1 0 0
Rank docID score
1 D2 3
2 D1 2
3 D3 1
Query: What is information retrieval?Q: Information 1, retrieval 1
Search Engines 6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Search Engines 7
15
16
17
18
19 20
Result Categories1. Encyclopedia2. Naver Books3. Q&A DB (지식iN)4. Magazine5. Café6. Blog7. Book8. Map9. Website10. Advertisement (파워링크)
11. Image12. Webpage13. Naver News Library14. Video15. Naver AppStore16. Naver Scholar17. Naver Post18. Naver Shopping19. News20. Naver Dictionary
Proprietary (Naver-specific) content Dynamic category order Toppings
• Search by Category• Related Searches• Popular Searches (by category)
Query: 정보검색(Information Retrieval)
Query: 검색엔진(Search Engine)
Search Engines 8
1
2
Result Categories1. Webpage2. Advertisement
Webpage-centric content Dynamic category order Toppings
• Search by Category• Related Searches
Query: Information Retrieval
Query: Search Engine
Search Engine vs. Database vs. Directories
Search Engines 9
Search Engine Database Directories
Corpus Type General Specific General/Specific
Data Collection Automatic - crawler/spider
Manual - data entry/import
Manual- classification
Data Quality Not controlled Controlled Controlled
Data Organization None(bag-of-words)
Structured - Relational
Structured - Hierarchical
Query Input Text box Field-specific - Boolean
Text boxCategory Tree
Search Result Ranked- documents
Not ranked- records
Ranked- categories
Search Index Document text Database Tables Category Tree
e.g. Google Library Search dmoz.org
WIDIT 2003: Web IR System
Search Engines 10
Indexing Module
Sub-indexes
Body Index Anchor Index Header Index
Documents
Topicsqueries
Simple Queries
queries
Phrase Queries
Retrieval ModuleFusion Module
Sub-indexes Sub-indexes
Search Results
Reranking Module
Fusion Result
Final Result
System Training
Dynamic Tuning
WIDIT 2004: Web IR w/ Query Classification
Search Engines 11
Indexing Module
Sub-indexes
Body Index Anchor Index Header Index
Documents
Topics Queries
Simple Queries
Queries
Expanded Queries
Retrieval ModuleFusion Module
Sub-indexes Sub-indexes
Search Results
Re-ranking Module
Fusion Result
Final Result
Static Tuning
Dynamic Tuning
Query Classification
Module
Query Types
WIDIT 2004: Dynamic Tuning
Search Engines 12
WIDIT 2005: Web HARD IR System
Search Engines 13
Topics
WordNet
NLPModule
Web
CF
Documents
OSWModule
WebXModule
IndexingModule
Inverted Index
SynonymDefinition
Noun PhraseWeb TermsOSW Phrase
Search Results
Retrieval ModuleFusion Module
Automatic Tuning
Baseline Result
CF Terms
Post-CFResult
Re-ranking Module
FinalResult
User
WIDIT 2006: Blog IR System
Search Engines 14