Upload
brady-adkins
View
22
Download
0
Embed Size (px)
DESCRIPTION
Query Languages. Information Retrieval. Concerned with the: Representation of Storage of Organization of, and Access to Information items. Recap 1. Important points Data retrieval vs. information retrieval Users specify their needs via an intermediary language. - PowerPoint PPT Presentation
Citation preview
Query Languages
Information Retrieval
Concerned with the:• Representation of• Storage of• Organization of, and• Access to
Information items.
Recap 1Important points
– Data retrieval vs. information retrieval– Users specify their needs via an intermediary
language.– Documents are represented by an abstraction of their
content.– Traditional model vs. berry-picking model– Evaluation (precision/recall, single-value measures,
human measures)– Task characteristics (question answering, open-
ended analysis, ad-hoc vs. filtering)– Collection characteristics (size, document relations)
Recap 2Important points
– Content types (text, descriptive/semantic metadata, multimedia)
– Metadata (formats & sets)– Information Theory (entropy)– Models of symbol distribution (Zipf’s law,
Heap’s law)– Distance measures (Hamming distance,
Levenshtein distance)– Markup languages (SGML, HTML, XML)– Multimedia formats (header + data)
Query Languages
Query language determines which queries can be formulated– User-oriented languages– System-oriented languages (protocols)
Language dependent on underlying information retrieval model
Systems may enhance the query using– Word expansion (thesaurus & stemming)– removing stopwords
Keyword-Based Querying
A query is composed of keywords– Documents containing keywords are returned– Intuitive, easy to express, fast ranking– Single-word and multi-word queries
Classification of keyword-based queries– single-word queries– context queries– Boolean queries– natural language queries
Single-Word QueriesWord query
– the most elementary query that can be formulated– a word is a sequence of letters surrounded by
separators– in many models, words are only types of queries
allowed
Result of word queries– the set of documents containing at least one of the
words of the query– the resulting documents are ranked according to a
degree of similarity to the query• use term frequency and inverse document frequency
Context Queries
Phrase query– a sequence of single-word queries– ignore separators and uninteresting words
example: “…enhance the retrieval…”– ranked in a fashion analogous to single words
Proximity query– a phrase query with a maximum allowed distance
(character or word) between words in the queryexample: distance = 4 “enhance the power of retrieval”
– physical proximity has semantic value: the words in the same paragraph are related in some way
Boolean Queries
Boolean query– composed of atoms (basic queries) that retrieve
documents, and of Boolean operators which work on their operands (sets of documents)
Query syntax tree– compositional scheme– leaves: basic queries– internal nodes: operators
AND
ORtranslation
syntaxsyntactic
Boolean Queries
Operators in Boolean queries– e1 OR e2 : selecting all docs satisfying e1 or e2– e1 AND e2 : selecting all docs satisfying both e1 and
e2– e1 BUT e2 : selecting all docs satisfying e1 but not e2
Classic Boolean system– no ranking of the retrieved docs– does not allow partial matching– alternative: fuzzy Boolean set of operators
• meaning of AND and OR can be relaxed (e.g., appearing in some operands)
Natural Language
Natural language query– blurring the distinction between AND and OR → query
becomes an enumeration of words and context queries– higher ranking is assigned to those documents
matching more parts of the query
Characteristics– retrieving all the documents close to the query– a complete document can be used as a query → leads
to the use of relevance feedback techniques (user selects a document from the result, and submits it as a new query)
– example system: AskJeeves
Query Languages:Patterns & Structures
Pattern Matching
Pattern– a set of syntactic features that must occur in a text
segment
Types of patterns– Words: string (sequence of characters) in the text– Prefixes: string forming the beginning of a text word
(e.g., comput → computer, computation)– Suffixes: string forming the termination of a text word
(e.g., ters → computers, painters)– Substrings: string appeared within a text word
(allowed word separators) (e.g., tal → talk, metallic & any flow → many flowers)
Pattern Matching (cont.)
More Types of Patterns– Ranges: a pair of strings matched any word lying
between them in lexicographical order(e.g., held to hold → hoax, hissing)
– Allowing errors: retrieving word similar to given word(e.g., flower → flo wer [edit distance = 1])
– Regular expressions: general patterns built up by simple strings and operators
(e.g., pro (blem | tein) (s | ε) (0 | 1 | 2)* )→ Problem02, proteins
– Extended patterns: classes of characters, conditional expressions, wild characters, combinations
Structural QueriesStructural query
– mixing contents and structure in queries• content constraints (words, phrases, patterns)• structural constraints (containment, proximity) and
restrictions on structural elements (chapters, sections)
Type of structures of text– form-like fixed structure– hypertext structure– hierarchical structure
Structural Queries (cont.)
Types of structures
form hypertext hierarchical
Fixed StructureTraditional restrictions
– documents had a fixed set of fields– each field had some text inside– only rarely the fields appear in any order or repeat– fields were not allowed to nest or overlap– retrieval: specifying a given basic pattern to be found
only in a given field
Characteristics– reasonable to retrieve text collection having a fixed
structure (e.g. mail archive) → inadequate to represent the hierarchical structure such as HTML docs
– expansion to relational DB model
HypertextHypertext (navigational)
– a directed graph where the nodes hold some text and the links represent connections between nodes or between positions inside the nodes
Browsing / Searching in hypertext– retrieval from a hypertext: browsing (traversing the
hypertext nodes following link → navigational activity)– even in web, one can search by the text contents of
the nodes, but not by their structural connectivity– some search engines now allow searching for specific
source or destination anchors (but not general structure + content queries)
Hierarchical Structure
Hierarchical structure– an intermediate structuring model lying between fixed
structure and hypertext – represents a recursive decomposition of the text– a natural model for many text collections, e.g., books,
articles, legal documents, structured programs, etc.
Hierarchical models– PAT Expressions, Overlapped Lists, List of References,
Proximal Nodes, Tree Matching
Issues in hierarchical models– static or dynamic structure, restrictions on the structure,
integration with text, query language
Query Protocols
Query protocols– query language used to query text database– standards intended not for human use but for
querying library systems and querying CD-ROMs
Some important query protocols– Z39.50– WAIS– CCL– CD-RDx– SFQL