Transcript
Page 1: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

Introduction to Search Engines

Page 2: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

Search Engine Overview

User Intermediary Information

What am I looking for?- Identification of info. need

What question do I ask?- Query formulation

What is the searcher looking for?- Discovery of user’s info. need

How should the question be posed? - Query representation

Where is the relevant information?- Query-document matching

What data to collect?- Collection development

What information to index?- Indexing/RepresentationHow to represent it?- Data structure

Search Engines 2

Searchable Index(색인)

Query(질의)

Search Results

1

23

0

Search Data (0)(1) Query Indexing(2) Document Ranking(3) Result Display

1. Document Collection- e.g., spider/crawler

2. Document Indexing- term indexing

(tokenizing, stop & stem)- term weighting

Page 3: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

Search Engine: Data Document Collection

► Select target data sources – e.g., domain, corpus, WWW

► Harvest data – e.g., data entry, data import, spider/crawler

Document Indexing► Select indexing sources (색인어) – e.g., metadata, keywords, content

► Extract indexing terms – e.g., tokenization, stop & stem

► Assign term weights – e.g., tf-idf, okapi

Search Engines 3

“The frequency of word occurrence in an article furnishes a useful measurement of word significance.”- 문헌에출현한단어들은문헌의내용분석을위해사용될수있으며, 단어의

출현빈도가이단어의주제어로서의중요성을측정하는기준이된다 .

Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165.

Page 4: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

TokensTokens

Search Engine: Indexing Process

Search Engines 4

Documents(Text)

Tokenization

Token Selection

Token Normalization

Tokens

TokensTokensSelectTokens

TokensTokensSEQUENTIALINDEX

Term Weighting

INVERTEDINDEX

D1 D2 D3

wd1 (information) 1 1 1

wd2 (model) 0 1 1

wd3 (retrieval) 1 2 0

wd4 (seminar) 1 0 0

D1: Information retrieval seminarsD2: Retrieval Models and Information RetrievalD3: Information Model D1 information 1, retrieval 1, seminar 1

D2 information 1, model 1, retrieval 2

D3 information 1, model 1D1: information, retrieval, seminar(s)D2: retrieval, model(s), and, information, retrievalD3: information, model

Page 5: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

Search Engine: Search Query Indexing

► Tokenization► Stop & Stem► Term Weighting

Document Ranking► Query-Document matching► Document Score computation

Result Display► Content - e.g., title & snippets

► Layout - e.g., grouped by category

► Toppings - e.g., related searches

Search Engines 5

Index Term D1 D2 D3

wd1 (information) 1 1 1

wd2 (model) 0 1 1

wd3 (retrieval) 1 2 0

wd4 (seminar) 1 0 0

Rank docID score

1 D2 3

2 D1 2

3 D3 1

Query: What is information retrieval?Q: Information 1, retrieval 1

Page 6: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

Search Engines 6

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Page 7: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

Search Engines 7

15

16

17

18

19 20

Result Categories1. Encyclopedia2. Naver Books3. Q&A DB (지식iN)4. Magazine5. Café6. Blog7. Book8. Map9. Website10. Advertisement (파워링크)

11. Image12. Webpage13. Naver News Library14. Video15. Naver AppStore16. Naver Scholar17. Naver Post18. Naver Shopping19. News20. Naver Dictionary

Proprietary (Naver-specific) content Dynamic category order Toppings

• Search by Category• Related Searches• Popular Searches (by category)

Query: 정보검색(Information Retrieval)

Query: 검색엔진(Search Engine)

Page 8: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

Search Engines 8

1

2

Result Categories1. Webpage2. Advertisement

Webpage-centric content Dynamic category order Toppings

• Search by Category• Related Searches

Query: Information Retrieval

Query: Search Engine

Page 9: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

Search Engine vs. Database vs. Directories

Search Engines 9

Search Engine Database Directories

Corpus Type General Specific General/Specific

Data Collection Automatic - crawler/spider

Manual - data entry/import

Manual- classification

Data Quality Not controlled Controlled Controlled

Data Organization None(bag-of-words)

Structured - Relational

Structured - Hierarchical

Query Input Text box Field-specific - Boolean

Text boxCategory Tree

Search Result Ranked- documents

Not ranked- records

Ranked- categories

Search Index Document text Database Tables Category Tree

e.g. Google Library Search dmoz.org

Page 10: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

WIDIT 2003: Web IR System

Search Engines 10

Indexing Module

Sub-indexes

Body Index Anchor Index Header Index

Documents

Topicsqueries

Simple Queries

queries

Phrase Queries

Retrieval ModuleFusion Module

Sub-indexes Sub-indexes

Search Results

Reranking Module

Fusion Result

Final Result

System Training

Dynamic Tuning

Page 11: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

WIDIT 2004: Web IR w/ Query Classification

Search Engines 11

Indexing Module

Sub-indexes

Body Index Anchor Index Header Index

Documents

Topics Queries

Simple Queries

Queries

Expanded Queries

Retrieval ModuleFusion Module

Sub-indexes Sub-indexes

Search Results

Re-ranking Module

Fusion Result

Final Result

Static Tuning

Dynamic Tuning

Query Classification

Module

Query Types

Page 12: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

WIDIT 2004: Dynamic Tuning

Search Engines 12

Page 13: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

WIDIT 2005: Web HARD IR System

Search Engines 13

Topics

WordNet

NLPModule

Web

CF

Documents

OSWModule

WebXModule

IndexingModule

Inverted Index

SynonymDefinition

Noun PhraseWeb TermsOSW Phrase

Search Results

Retrieval ModuleFusion Module

Automatic Tuning

Baseline Result

CF Terms

Post-CFResult

Re-ranking Module

FinalResult

User

Page 14: Introduction to Search Engineswidit2.knu.ac.kr/~kiyang/teaching/gSE/f18/lectures/2.gSE... · 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary

WIDIT 2006: Blog IR System

Search Engines 14


Recommended