27
Hsin-Hsi Chen 4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Informati on Engineering National Taiwan University

Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Embed Size (px)

Citation preview

Page 1: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-1

Chapter 4 Query Language

Hsin-Hsi ChenDepartment of Computer Science and Information Engineering

National Taiwan University

Page 2: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-2

Introduction

• Goals– Which queries can be formulated– How the formulation is related to underlying

information retrieval models

• Query languages

Page 3: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-3

Boolean queriesFuzzy Boolean

structured queries

proximity

phrases

words

errors

substringsprefixessuffixes

regular expressionsextended patterns

natural language

keywords andcontext

pattern matching

basic queries

Page 4: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-4

Keyword-Based Querying

• single-word queries– A query is formulated by a word– A document is formulated by long sequences of words.– A word is a sequence of letters surrounded by separators– What are letters and separators?

• e.g., ‘on-line’

– Chinese sentences are composed of characters without word boundaries

– The division of the text into words is not arbitrary(This topic will be dealt with in a special talk for Chinese IR)

Page 5: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-5

斷詞問題• 問題

– 中文句子詞與詞之間並沒有明顯的分隔記號。– 這名記者會說國語。

• 這 名 記者 會 說 國語。• 這 名 記者會 說 國語。

• 詞的定義– 具有獨立意義,且扮演特定語法功能的字串應視為一個詞。

• 分詞標準– 中國大陸【信息處理用現代漢語分詞規範】

• 1989 年制定• 1993 年呈報國家標準

Page 6: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-6

斷詞問題 ( 續 )

–台灣【資訊處理用中文分詞標準草案】• 1996 年中華民國計算語言學學會草擬• 基本原則

–語義無法由組合成分直接相加而得之字串,應該分為一分詞單位。例如:撞期 vs 撞山

–詞類無法由組合成分直接得到,應該合為一分詞單位。例如:好喝

Page 7: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-7

處理模式• 詞典是不可缺少的重要資源

– 列出“所有”可能的詞• 把他的確實行動作了分析把,他,的,確實,實行,行動,動作,了,分析

• 電子計算機是會計算題目的機器電子,計算,計算機,電子計算機,是,會,會計,計算,計算題,題目,目的,的,機器

– word lattice

電 子 計 算 機 是 會 計 算 題 目 的 機 器

Page 8: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-8

處理模式 ( 續 )

• 歧義排除機置– 挑出最佳組合– 策略

• 規則式– 長詞優先台灣大學 是 有名 的 學府長詞遮蔽短詞:這 名 記者 會 說 國語。

– 除去造成路徑中斷的詞區段– 經驗法則:偏好三字詞 , ...– 剖析器

• 統計式– 馬可夫模型 , 鬆 弛法 , ...

– 效能─各家都宣稱有百分之九十五以上的準確率

Page 9: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-9

處理模式 ( 續 )

• 問題所在–詞典是否收錄所有可能的詞?

• A- 錢,凍蒜–策略

• 構詞率• ( 半 ) 自動建立新的詞典• 未知詞處理模式

Page 10: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-10

構詞率• 數詞與量詞的形成

– 一個個 , 一條條• 日期與時間

– 八十五年十月四日• 名詞或動詞的前綴或後綴

– 學生們• 特殊動詞

– 丟丟 看,吃吃 看,寫寫 看– 高高興興,歡歡喜喜,漂漂亮亮,迷迷糊糊– 打打球,跑跑步,寫寫字

• ...

Page 11: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-11

Context Queries

• definition– Search words in a given context, e.g., near other words

• types– phrase

• a sequence of single-word queries• e.g., enhance retrieval

– proximity• a sequence of single words or phrases, and a maximum

allowed distance between them are specified• e.g., within distance(enhance, retrieval, 4) will match

‘… enhance the power of retrieval …’

Page 12: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-12

Boolean Queries

• definition– A syntax composed of atoms that retrieve

documents, and of Boolean operators which work on their operands

– e.g., translation AND syntax OR syntactic

AND

translation OR

syntax syntactic

query syntax tree

Page 13: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-13

Boolean Queries (Continued)

• operands– (e1 OR e2)

• Select all documents which satisfy e1 or e2. Duplicates are eliminated.

– (e1 AND e2)• Select all documents which satisfy both e1 and e2.

– (e1 BUT e2)• Select all documents which satisfy e1 but not e2

• “fuzzy boolean”– Retrieve documents appearing in some operands

(The AND may require it to appear in more operands than the OR)

Page 14: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-14

Natural Language

• generalization of “fuzzy Boolean”

• A query is an enumeration of words and context queries.

• All the documents matching a portion of the user query are retrieved.

Page 15: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-15

Pattern Matching

• A pattern is a set of syntactic features that must occur in a text segment

• types– words– prefixes, e.g., ‘comput’ ‘computer’, ‘computation’, ‘comp

uting’, etc.– suffixes, e.g, ‘ters’ ‘computers’, ‘testers’, ‘painters’, etc.– substrings, e.g., ‘tal’ ‘coastal’, ‘talk’, ‘metallic’, etc.– Ranges (lexicographic order), between ‘held’ and ‘hold’ ‘

hoax’ and ‘hissing

Page 16: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-16

Pattern Matching (Continued)

– allowing errors• Retrieve all text words which are ‘similar’ to the giv

en word

• edit distance: the minimum number of character insertions, deletions, and replacements needed to make two strings equal, e.g., ‘flower’ and ‘flo wer’

• maximum allowed edit distance: query specifies the maximum number of allowed errors for a word to match the pattern

Page 17: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-17

Pattern Matching (Continued)

– regular expressions• union: if e1 and e2 are regular expressions, then (e1 | e2) matc

hes what e1 or e2 matches

• concatenation: if e1 and e2 are regular expressions, the occurrences of (e1 e2) are formed by the occurrences of e1 immediately followed by those of e2

• repetition: if e is a regular expression, then (e*) matches a sequence of zero or more contiguous occurrence of e.

• ‘pro (blem | tein) (s | ) (0 | 1 | 2)*’ ‘problem2’ and ‘proteins’

Page 18: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-18

Pattern Matching (Continued)

– extended patterns• subsets of the regular expressions expressed with a

simpler syntax

• classes of characters

• conditional expressions

• wild characters which match any sequence in the text

• combinations

Page 19: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-19

Structural Queries

• mixing contents and structure in queries– contents: words, phrases, or patterns– structural constraints: containment, proximity, or other

restrictions on structural elements

• issues– what structure a text may have– what queries can be made on which structures

• three main structures– form-like fixed structure– hypertext structure– hierarchical structure

Page 20: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-20

Form-like fixed structureDocument: a fixed set of fields For example, a mail has a sender, a receiver, a date, a subject and abody field. Search for the mails sent to a given person with “football” in the Subject field

fields

text

text

text

text

Page 21: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-21

Hypertext structureA hypertext is a directed graph where nodes hold some textthe links represent connections between nodes or between positions inside nodes

(text contents)

(structural connectivity)

WebGlimpse: combine browsing and searching on the Web

Page 22: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-22

WebGlimpse(http://glimpse.cs.arizona.edu/webglimpse/index.html

• WebGlimpse is a fast, flexible search engine for finding information in a related web of pages.

• The ability to index pages on remote sites provides a level of power one step above most search engine tools.

• You can define your own sub-area of the web simply by making a page of links to all relevant sites.

• Webglimpse will search by following your links, to whatever 'depth' you specify.

Page 23: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-23

Hierarchical StructureRecursive decomposition of the text

Page 24: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-24

Chapter 44.1 IntroductionWe cover in this chapterthe different kinds of ……4.4 Structural Queries…

chapter

section section

title title figure

Introduction We cover … … Structural … …

in

with

with

figure

section

title “structural”

Page 25: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-25

Issues

• static or dynamic structure– statistic: there are one or more explicit hierarchies– dynamic: the required elements are built on the fly

using text makeup

• restrictions on the structure – The text or the answers may have restrictions

about nesting and/or overlapping

Page 26: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-26

Issues (Continued)

• integration with text– integration of queries on text content with queries on text

structure

• query language– features

• selection of areas that contain (or not) other areas• selection of areas that are contained (or not) in other areas• selection of areas that follow (or are followed by) other areas• selection of areas that are close to other areas• set manipulation

– standardization, expressiveness taxonomy or formal categorization

Page 27: Hsin-Hsi Chen4-1 Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4-27

A Sample of Hierarchical Models

• PAT Expressions

• Overlapped Lists

• Proximal Nodes

• Tree Matching