20
Query Languages

Query Languages

Embed Size (px)

DESCRIPTION

Query Languages. Information Retrieval. Concerned with the: Representation of Storage of Organization of, and Access to Information items. Recap 1. Important points Data retrieval vs. information retrieval Users specify their needs via an intermediary language. - PowerPoint PPT Presentation

Citation preview

Page 1: Query Languages

Query Languages

Page 2: Query Languages

Information Retrieval

Concerned with the:• Representation of• Storage of• Organization of, and• Access to

Information items.

Page 3: Query Languages

Recap 1Important points

– Data retrieval vs. information retrieval– Users specify their needs via an intermediary

language.– Documents are represented by an abstraction of their

content.– Traditional model vs. berry-picking model– Evaluation (precision/recall, single-value measures,

human measures)– Task characteristics (question answering, open-

ended analysis, ad-hoc vs. filtering)– Collection characteristics (size, document relations)

Page 4: Query Languages

Recap 2Important points

– Content types (text, descriptive/semantic metadata, multimedia)

– Metadata (formats & sets)– Information Theory (entropy)– Models of symbol distribution (Zipf’s law,

Heap’s law)– Distance measures (Hamming distance,

Levenshtein distance)– Markup languages (SGML, HTML, XML)– Multimedia formats (header + data)

Page 5: Query Languages

Query Languages

Query language determines which queries can be formulated– User-oriented languages– System-oriented languages (protocols)

Language dependent on underlying information retrieval model

Systems may enhance the query using– Word expansion (thesaurus & stemming)– removing stopwords

Page 6: Query Languages

Keyword-Based Querying

A query is composed of keywords– Documents containing keywords are returned– Intuitive, easy to express, fast ranking– Single-word and multi-word queries

Classification of keyword-based queries– single-word queries– context queries– Boolean queries– natural language queries

Page 7: Query Languages

Single-Word QueriesWord query

– the most elementary query that can be formulated– a word is a sequence of letters surrounded by

separators– in many models, words are only types of queries

allowed

Result of word queries– the set of documents containing at least one of the

words of the query– the resulting documents are ranked according to a

degree of similarity to the query• use term frequency and inverse document frequency

Page 8: Query Languages

Context Queries

Phrase query– a sequence of single-word queries– ignore separators and uninteresting words

example: “…enhance the retrieval…”– ranked in a fashion analogous to single words

Proximity query– a phrase query with a maximum allowed distance

(character or word) between words in the queryexample: distance = 4 “enhance the power of retrieval”

– physical proximity has semantic value: the words in the same paragraph are related in some way

Page 9: Query Languages

Boolean Queries

Boolean query– composed of atoms (basic queries) that retrieve

documents, and of Boolean operators which work on their operands (sets of documents)

Query syntax tree– compositional scheme– leaves: basic queries– internal nodes: operators

AND

ORtranslation

syntaxsyntactic

Page 10: Query Languages

Boolean Queries

Operators in Boolean queries– e1 OR e2 : selecting all docs satisfying e1 or e2– e1 AND e2 : selecting all docs satisfying both e1 and

e2– e1 BUT e2 : selecting all docs satisfying e1 but not e2

Classic Boolean system– no ranking of the retrieved docs– does not allow partial matching– alternative: fuzzy Boolean set of operators

• meaning of AND and OR can be relaxed (e.g., appearing in some operands)

Page 11: Query Languages

Natural Language

Natural language query– blurring the distinction between AND and OR → query

becomes an enumeration of words and context queries– higher ranking is assigned to those documents

matching more parts of the query

Characteristics– retrieving all the documents close to the query– a complete document can be used as a query → leads

to the use of relevance feedback techniques (user selects a document from the result, and submits it as a new query)

– example system: AskJeeves

Page 12: Query Languages

Query Languages:Patterns & Structures

Page 13: Query Languages

Pattern Matching

Pattern– a set of syntactic features that must occur in a text

segment

Types of patterns– Words: string (sequence of characters) in the text– Prefixes: string forming the beginning of a text word

(e.g., comput → computer, computation)– Suffixes: string forming the termination of a text word

(e.g., ters → computers, painters)– Substrings: string appeared within a text word

(allowed word separators) (e.g., tal → talk, metallic & any flow → many flowers)

Page 14: Query Languages

Pattern Matching (cont.)

More Types of Patterns– Ranges: a pair of strings matched any word lying

between them in lexicographical order(e.g., held to hold → hoax, hissing)

– Allowing errors: retrieving word similar to given word(e.g., flower → flo wer [edit distance = 1])

– Regular expressions: general patterns built up by simple strings and operators

(e.g., pro (blem | tein) (s | ε) (0 | 1 | 2)* )→ Problem02, proteins

– Extended patterns: classes of characters, conditional expressions, wild characters, combinations

Page 15: Query Languages

Structural QueriesStructural query

– mixing contents and structure in queries• content constraints (words, phrases, patterns)• structural constraints (containment, proximity) and

restrictions on structural elements (chapters, sections)

Type of structures of text– form-like fixed structure– hypertext structure– hierarchical structure

Page 16: Query Languages

Structural Queries (cont.)

Types of structures

form hypertext hierarchical

Page 17: Query Languages

Fixed StructureTraditional restrictions

– documents had a fixed set of fields– each field had some text inside– only rarely the fields appear in any order or repeat– fields were not allowed to nest or overlap– retrieval: specifying a given basic pattern to be found

only in a given field

Characteristics– reasonable to retrieve text collection having a fixed

structure (e.g. mail archive) → inadequate to represent the hierarchical structure such as HTML docs

– expansion to relational DB model

Page 18: Query Languages

HypertextHypertext (navigational)

– a directed graph where the nodes hold some text and the links represent connections between nodes or between positions inside the nodes

Browsing / Searching in hypertext– retrieval from a hypertext: browsing (traversing the

hypertext nodes following link → navigational activity)– even in web, one can search by the text contents of

the nodes, but not by their structural connectivity– some search engines now allow searching for specific

source or destination anchors (but not general structure + content queries)

Page 19: Query Languages

Hierarchical Structure

Hierarchical structure– an intermediate structuring model lying between fixed

structure and hypertext – represents a recursive decomposition of the text– a natural model for many text collections, e.g., books,

articles, legal documents, structured programs, etc.

Hierarchical models– PAT Expressions, Overlapped Lists, List of References,

Proximal Nodes, Tree Matching

Issues in hierarchical models– static or dynamic structure, restrictions on the structure,

integration with text, query language

Page 20: Query Languages

Query Protocols

Query protocols– query language used to query text database– standards intended not for human use but for

querying library systems and querying CD-ROMs

Some important query protocols– Z39.50– WAIS– CCL– CD-RDx– SFQL