Upload
xiaoyu-chen
View
106
Download
0
Tags:
Embed Size (px)
Citation preview
KEYWORDS SEARCH ON STRUCTURED DATABASEXiaoyu Chen, Min Li, Yihan Gao, Tianning Xu
Introduction
Structured data Schema as a summary of the data Retrieve through structured language
What would big data bring to structured data retrieval?
Introduction
In terms of high volume of data Hadoop + Pig Latin came to rescue
However, is this enough? Recall how you write selection. What do you
need to know Can you remember this ?
Introduction
Big data-> big and complicated schema Hard to remember and operate! May not even fit in main memory!
What should we do about it ? How does information retrieval deals
with this ?
Introduction
Search based on keywords No need for schema Efficiency guaranteed using index
All seem to to be straightforward and easy
What are the challenges ?
Introduction
Search for “Apple + company” Match to “apple(fruit)”, “Apple Inc.”,
“Adams’ apple” Which one is correct ? How to filter?
Challenge1: Filtering and disambiguation
Introduction
Search for “Steve Jobs + Apple”
Normalization. What to return ?
ID Name
Gender
Employer
Location
ID Company
Location
Type
Product
ID Street
City State Country
Challenge2: Automatic join back
Introduction
Search for “Jordan” Match “Jordan (brand)” , ”Michael Jordan
(player)”, “Michael Jordan (professor)” etc. All of them should match. Which one is
better ? Ranking
Challenge3: Ranking of the result
Literature Overview
Two kinds of approaches 1. Interpretative approach
Reuse database query language and index Translate the keywords into queries Will introduce 3 papers
2. Un-interpretative approach (focus) Typically build own index and data structure Model as graph and use graph-based analysis Will introduce 3 papers
Literature Overview – Interpretative approach
DBXplorer Sanjay Agrawal et al. General: two steps
Publish step: pre-computation, indexing etc. Search step: lookup, enumerate over join
tree, generate SQL etc. Efficiency:
Symbol table (index) design Symbol table compaction
Literature Overview – Interpretative approach
Publish step: 1: A database is identified, along with the
set of tables and columns within the database to be published.
2: Auxiliary tables are created for supporting keyword searches. E.g. index table
But, how to build efficient index ?
Literature Overview – Interpretative approach
Index goal: find out the keyword belonging row_id and column_id.
If the column (attribute) already has index, we need only column_id index (reuse database index)
ID Name Gender Addr Org
1
2
3
Column index
Row index
Literature Overview – Interpretative approach
Compress index table Foreign key constraint etc.
General Algorithm -- CP-Comp
Name Product
…
Name Gender
…
Sells table
Person table
Table 1. Compressed table Table 2.
Uncompressed table
Literature Overview – Interpretative approach
Search step Step 1: look up index find columns/rows of the
database that contain the query keywords. Step 2: All potential subsets of tables in the
database that, if joined, might contain rows having all keywords, are identified and enumerated. Join Tree
Step 3: For each enumerated join tree, a SQL statement is constructed (and executed) that joins the tables in the tree and selects those rows that contain all keywords. The final rows are ranked and presented to the user.
Literature Overview – Interpretative approach
Join Tree example:
Literature Overview – Interpretative approach
Keyword Search in Databases: The Power of RDBMS
Lu Qin et al. SIGMOD 09
Integrating IR and DB
DB techniques provide users with efficient ways to access structured data in RDBMSs
IR techniques allow users to use keywords to access unstructured data
Eg. Structural keyword search, finds how tuples that contain keywords in a RDB are interconnected (the structure), three types:
Schema-based approach
Connected Tree Semantics: query results in minimal total joining network of tuples; adjacent tuples joined by foreign key reference, #tuples <= Tmax
Connected Tree Semantics
1. Candidate Network (CN) generation: relational algebra expressions that creates trees with all keywords up to a certain size
2. CN evaluation: evaluates generated CNs using SQL
Schema-based approach
Distinct Root Semantics: query results in collection of tuples all reachable from root; root uniquely defines tuples, distance(any tuple, root) <= Dmax
Schema-based approachDistinct Core Semantics: query results in multi-center subgraphs (communities); keyword tuples uniquely defines a community, distance(any keyword tuple, any center tuple) <= Dmax
Distinct Core/Root Semantics 1. Creates pairs between tuple
containing keyword and every other tuple, that is the shortest distance between them
2. generate graphs using SQL with distinct core/roots
Literature Overview – Interpretative approach
Keyword search over relational databases: a metadata approach.
Bergamaschiet al. SIGMOD 11
Problem Definition
A database D is a collection of relational tables. Each relational table
contains its name, attributes and value domains. All these elements
together form the vocabulary.
A keyword query q is an ordered list of keywords. Each keyword
specifies the element of the interest.
A configuration of a keyword query on Database is an injective
mapping from the keyword to vocabulary of the database
Task: First derive the top configurations based on some metrics and
then interpret it as SQL query (select-project-join interpretations)
From Keywords to Queries
Need to consider inter-dependency of the query keywords:
Introduce two different kinds of weights: the intrinsic weights, and the
contextual weights
Need to give a ranked list of all the configurations
Develop an algorithm based on and extends the Hungarian (a.k.a.,
Munkres) algorithm
Need to separate the process of evaluating the schema terms and
value terms
Evaluate the value weights based on the schema mapping
Contributions and Insights
Formally define the problem of keyword querying over relational
databases that lack a-priori access to the database instance
Introduce the notion of a weight as a measure of the likelihood that
the semantics of a keyword are represented by a database structure.
Need to consider both intrinsic weights and contextual weights
Extend and exploit the Hungarian (a.k.a., Munkres) algorithm to
generate a ranking of different interpretations.
Literature Overview
Two kinds of approaches 1. Interpretative approach
Reuse database query language and index Translate the keywords into queries
2. Un-interpretative approach Typically build own index and data structure Model as graph and use graph-based
analysis
Literature Overview – Un-interpretative approach
Effective Keyword Search in Relational Databases
Fang Liu et al. SIGMOD 06
Difficulties of Keyword Search Keyword search in text databases only
need to compute score for each document
Keyword search on RDBMS more complicated (relations, attributes, tuples):
1. Generate tuple trees (answers) by joining tuples from different tables
2. Rank the answers by computing score
Generate Answer Tuple Trees
Tuple tree answer rules:1. Each leaf node in a tuple tree must contain at
least one keyword2. Each tuple only appears at most once in tree
Separate tuples into tuple sets that contain keywords and tuple sets that contain all tuples for each relation, join adjacent sets from schema graph within constraints of answer trees
Ranking Tuple Trees
Treat the text of each tuple within an answer set as a “document”
Assign similarity rating between each document and query, normalizing for: Term Frequency Document Frequency Document Length
Compute score for tuple tree as average over all documents
Focused work
Keyword Searching and Browsing in Databases using BANKS
Gaurav Bhalotia et al. ICDE 02
BANKS (Browsing And Keyword Searching)
a system which enables keyword-based search on relational databases, together with data and schema browsing
User HTTPBANKS System JDBC Database
Database and Query Model
Relational Database -> Directed Graph
Each Tuple in Database -> Node in Graph
Foreign Key -> Directed Edge
Database and Query Model
Database and Query Model
An answer to a query should be a subgraph connecting nodes matching the keywords.
The importance of a link depends upon the type of the link i.e. what relations it connects and on its semantics
Ignoring directionality would cause problems because of “hubs” which are connected to a large numbers of nodes.
Database and Query Model
We may restrict the information node to be from a selected set of nodes of the graph
We incorporate another interesting feature, namely node weights, inspired by prestige rankings
Node weights and tree weights need to be combined to get an overall relevance score
Formal Model
Node Weight : N(u) Depends on the prestige Set the node prestige = the in-
degree of the node Nodes that have multiple pointers
to them get a higher prestige
Formal Model
Edge Weights Some pupluar tuples can be connected
many other tuples Edge with forward and backward edge weights
Weight of a forward link = the strength of the proximity relationship between two tuples (set to 1 by default)
Weight of a backward link = in-degree of edges pointing to the node
Formal Model
Graph Score = Node Score + Edge Score
Node Score = Sum of node weight
Edge Score = 1 / (1 + Sum of Edge Weight)
Multiplicative:
Result
Result of query “sudarshan soumen”
Searching for the best answer Backward Expanding Search
Algorithm Intuition: find vertices from which a
forward path exists to at least one node from each Si.
Run concurrent single source shortest path algorithm from each node matching a keyword
Searching for the best answer
S. Sudarshan
Prasan Roy
writes
author
paper
Charuta
BANKS: Keyword search…
As an extension of BANKS
BLINKS: ranked keyword searches on graphs.
He H et al. SIGMOD 07
Introduction
Efficient ranked keyword searches on schemaless node-labeled graphs.
Challenges: Lack of schema for optimization Hard to guarantee strong performance
Proposed technique Backward search algorithm SLINKS: single-level index search * Extension for scalability: BLINKS ( bi-level index search )
Contributions Cost-balanced expansion based backward search Combining indexing with search Partition-based indexing (bi-level indexing)
Problem Formulation
Given search query = () and a directed graph G, an answer to q is to q is a pair < >, where r and ’s are nodes in G, satisfying:
Coverage: For every , node contains keyword Connectivity: For every , there exists a directed path in G from
r to How to measure which result is better?
A score function measuring both the graph structure and content.
Major part is the combined shortest distance between r and
Backward search algorithm
1. For every keyword maintain a cluster denote the set of nodes that can reach query keyword .
2. Initially, starts out as the set of nodes that directly contain ; we call this initial set the cluster origin and its member nodes keyword nodes.
3. In each search step, we choose an incoming edge to expand all clusters.
Example:
c’s cluster:
Initially: {3,12}
After chosen 5->12: {3,5,12}. All the incoming edge for 5 is visible for the algorithm now.
How to choose which edge to be included?
Expand a cluster by the node with the shortest distance to
How to choose which cluster to expand?
To expand the cluster with smallest cardinality
A single level index
Use precomputation to enhance online performance To find out the shortest distance to , compute the keyword-
node lists Keyword Node List: a ordered list of nodes that can reach
keyword .
(dist, node, first, knode) dist: distance between node and a node containing w first: the first node on the shortest path knode: a node containing w for the shortest distance
A single level index
Use precomputation to enhance online performance To augment backward search with forward expansion( to look
forward from current state to see whether we have found an answer )
Node-Keyword Map: returns the shortest distance from u to w ( inf for cannot reach)
SLINKS Algorithm Expanding backwards:
Use a cursor to traverse each keyword-node list , which gives the next node to expand in each cluster. Across clusters, we pick a cursor to.
Across cluster, pick a cursor to expand next in a round-robin manner based on the cost-balanced expansion among clusters
Expanding forward: As soon as we visit a node, we look up its distance to the other
keywords using . We can immediately determine if we have found the root of an answer. (none of the distance to w is inf)
Stopping: Maintain a threshold t, which is the current k-th shortest combined
distance among all known answer root. The combined distance for any unvisited node is at least .
BLINKS ( brief idea)
The index is too large to store and too expensive to construct in large graphs?
Use a divide and conquer approach to create a bi-level index Partition the data graph into multiple subgraphs, or blocks. Intra-Block Index
indexes information inside a block 4 kinds of index, 2 for separator nodes (important, so specially
considered ) Block Index
2 simple index
Conclusion
Keywords search challenges: Filtering and disambiguation Automatic join back Ranking of the result
Additional consideration: Efficiency Space
Thank you and have fun