Click here to load reader

Introduction to Search kiyang/teaching/gSE/f18/lectures/2.gSE... · PDF file 2018-09-18 · Introduction to Search Engines. Search Engine Overview User. Intermediary. Information:

  • View
    2

  • Download
    0

Embed Size (px)

Text of Introduction to Search kiyang/teaching/gSE/f18/lectures/2.gSE... · PDF file...

  • Introduction to Search Engines

  • Search Engine Overview

    User Intermediary Information

    What am I looking for? - Identification of info. need

    What question do I ask? - Query formulation

    What is the searcher looking for? - Discovery of user’s info. need

    How should the question be posed? - Query representation

    Where is the relevant information? - Query-document matching

    What data to collect? - Collection development

    What information to index? - Indexing/Representation How to represent it? - Data structure

    Search Engines 2

    Searchable Index (색인)

    Query (질의)

    Search Results

    1

    23

    0

    Search Data (0) (1) Query Indexing (2) Document Ranking (3) Result Display

    1. Document Collection - e.g., spider/crawler

    2. Document Indexing - term indexing

    (tokenizing, stop & stem) - term weighting

  • Search Engine: Data  Document Collection

    ► Select target data sources – e.g., domain, corpus, WWW ► Harvest data – e.g., data entry, data import, spider/crawler

     Document Indexing ► Select indexing sources (색인어) – e.g., metadata, keywords, content ► Extract indexing terms – e.g., tokenization, stop & stem ► Assign term weights – e.g., tf-idf, okapi

    Search Engines 3

    “The frequency of word occurrence in an article furnishes a useful measurement of word significance.” - 문헌에출현한단어들은문헌의내용분석을위해사용될수있으며, 단어의

    출현빈도가이단어의주제어로서의중요성을측정하는기준이된다 .

    Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165.

  • TokensTokens

    Search Engine: Indexing Process

    Search Engines 4

    Documents (Text)

    Tokenization

    Token Selection

    Token Normalization

    Tokens

    TokensTokensSelect Tokens

    TokensTokensSEQUENTIAL INDEX

    Term Weighting

    INVERTED INDEX

    D1 D2 D3

    wd1 (information) 1 1 1

    wd2 (model) 0 1 1

    wd3 (retrieval) 1 2 0

    wd4 (seminar) 1 0 0

    D1: Information retrieval seminars D2: Retrieval Models and Information Retrieval D3: Information Model D1 information 1, retrieval 1, seminar 1

    D2 information 1, model 1, retrieval 2

    D3 information 1, model 1D1: information, retrieval, seminar(s)D2: retrieval, model(s), and, information, retrieval D3: information, model

  • Search Engine: Search  Query Indexing

    ► Tokenization ► Stop & Stem ► Term Weighting

     Document Ranking ► Query-Document matching ► Document Score computation

     Result Display ► Content - e.g., title & snippets ► Layout - e.g., grouped by category ► Toppings - e.g., related searches

    Search Engines 5

    Index Term D1 D2 D3

    wd1 (information) 1 1 1

    wd2 (model) 0 1 1

    wd3 (retrieval) 1 2 0

    wd4 (seminar) 1 0 0

    Rank docID score

    1 D2 3

    2 D1 2

    3 D3 1

    Query: What is information retrieval? Q: Information 1, retrieval 1

  • Search Engines 6

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

  • Search Engines 7

    15

    16

    17

    18

    19 20

    Result Categories 1. Encyclopedia 2. Naver Books 3. Q&A DB (지식iN) 4. Magazine 5. Café 6. Blog 7. Book 8. Map 9. Website 10. Advertisement (파워링크)

    11. Image 12. Webpage 13. Naver News Library 14. Video 15. Naver AppStore 16. Naver Scholar 17. Naver Post 18. Naver Shopping 19. News 20. Naver Dictionary

     Proprietary (Naver-specific) content  Dynamic category order  Toppings

    • Search by Category • Related Searches • Popular Searches (by category)

    Query: 정보검색 (Information Retrieval)

    Query: 검색엔진 (Search Engine)

    http://search.naver.com/search.naver?where=nexearch&query=%EC%A0%95%EB%B3%B4%EA%B2%80%EC%83%89 http://search.naver.com/search.naver?where=nexearch&query=%EA%B2%80%EC%83%89%EC%97%94%EC%A7%84 https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EA%B2%80%EC%83%89%EC%97%94%EC%A7%84

  • Search Engines 8

    1

    2

    Result Categories 1. Webpage 2. Advertisement

     Webpage-centric content  Dynamic category order  Toppings

    • Search by Category • Related Searches

    Query: Information Retrieval

    Query: Search Engine

    https://www.google.co.kr/search?q=information+retrieval https://www.google.co.kr/search?q=search+engine&oq=search+engine&aqs=chrome..69i57j69i61l2j69i65l3.2111j0j8&sourceid=chrome&ie=UTF-8

  • Search Engine vs. Database vs. Directories

    Search Engines 9

    Search Engine Database Directories

    Corpus Type General Specific General/Specific

    Data Collection Automatic - crawler/spider

    Manual - data entry/import

    Manual - classification

    Data Quality Not controlled Controlled Controlled

    Data Organization None (bag-of-words)

    Structured - Relational

    Structured - Hierarchical

    Query Input Text box Field-specific - Boolean

    Text box Category Tree

    Search Result Ranked - documents

    Not ranked - records

    Ranked - categories

    Search Index Document text Database Tables Category Tree

    e.g. Google Library Search dmoz.org

    http://dmoz-odp.org/

  • WIDIT 2003: Web IR System

    Search Engines 10

    Indexing Module

    Sub-indexes

    Body Index Anchor Index Header Index

    Documents

    Topics queries

    Simple Queries

    queries

    Phrase Queries

    Retrieval Module Fusion Module

    Sub-indexes Sub-indexes

    Search Results

    Reranking Module

    Fusion Result

    Final Result

    System Training

    Dynamic Tuning

  • WIDIT 2004: Web IR w/ Query Classification

    Search Engines 11

    Indexing Module

    Sub-indexes

    Body Index Anchor Index Header Index

    Documents

    Topics Queries

    Simple Queries

    Queries

    Expanded Queries

    Retrieval Module Fusion Module

    Sub-indexes Sub-indexes

    Search Results

    Re-ranking Module

    Fusion Result

    Final Result

    Static Tuning

    Dynamic Tuning

    Query Classification

    Module

    Query Types

  • WIDIT 2004: Dynamic Tuning

    Search Engines 12

  • WIDIT 2005: Web HARD IR System

    Search Engines 13

    Topics

    WordNet

    NLP Module

    Web

    CF

    Documents

    OSW Module

    WebX Module

    Indexing Module

    Inverted Index

    Synonym Definition

    Noun PhraseWeb Terms OSW Phrase

    Search Results

    Retrieval Module Fusion Module

    Automatic Tuning

    Baseline Result

    CF Terms

    Post-CF Result

    Re-ranking Module

    Final Result

    User

  • WIDIT 2006: Blog IR System

    Search Engines 14

    Introduction to �Search Engines Search Engine Overview Search Engine: Data Search Engine: Indexing Process Search Engine: Search 슬라이드 번호 6 슬라이드 번호 7 슬라이드 번호 8 Search Engine vs. Database vs. Directories WIDIT 2003: Web IR System WIDIT 2004: Web IR w/ Query Classification WIDIT 2004: Dynamic Tuning WIDIT 2005: Web HARD IR System WIDIT 2006: Blog IR System