Presentation

KEYWORDS SEARCH ON STRUCTURED DATABASEXiaoyu Chen, Min Li, Yihan Gao, Tianning Xu

Introduction

Structured data Schema as a summary of the data Retrieve through structured language

What would big data bring to structured data retrieval?

Introduction

In terms of high volume of data Hadoop + Pig Latin came to rescue

However, is this enough? Recall how you write selection. What do you

need to know Can you remember this ?

Introduction

Big data-> big and complicated schema Hard to remember and operate! May not even fit in main memory!

What should we do about it ? How does information retrieval deals

with this ?

Introduction

Search based on keywords No need for schema Efficiency guaranteed using index

All seem to to be straightforward and easy

What are the challenges ?

Introduction

Search for “Apple + company” Match to “apple(fruit)”, “Apple Inc.”,

“Adams’ apple” Which one is correct ? How to filter?

Challenge1: Filtering and disambiguation

Introduction

Search for “Steve Jobs + Apple”

Normalization. What to return ？

ID Name

Gender

Employer

Location

ID Company

Location

Type

Product

ID Street

City State Country

Challenge2: Automatic join back

Introduction

Search for “Jordan” Match “Jordan (brand)” , ”Michael Jordan

(player)”, “Michael Jordan (professor)” etc. All of them should match. Which one is

better ? Ranking

Challenge3: Ranking of the result

Literature Overview

Two kinds of approaches 1. Interpretative approach

Reuse database query language and index Translate the keywords into queries Will introduce 3 papers

2. Un-interpretative approach (focus) Typically build own index and data structure Model as graph and use graph-based analysis Will introduce 3 papers

Literature Overview – Interpretative approach

DBXplorer Sanjay Agrawal et al. General: two steps

Publish step: pre-computation, indexing etc. Search step: lookup, enumerate over join

tree, generate SQL etc. Efficiency:

Symbol table (index) design Symbol table compaction


Publish step: 1: A database is identified, along with the

set of tables and columns within the database to be published.

2: Auxiliary tables are created for supporting keyword searches. E.g. index table

But, how to build efficient index ?


Index goal: find out the keyword belonging row_id and column_id.

If the column (attribute) already has index, we need only column_id index (reuse database index)

ID Name Gender Addr Org

1

2

3

Column index

Row index


Compress index table Foreign key constraint etc.

General Algorithm -- CP-Comp

Name Product

…

Name Gender

…

Sells table

Person table

Table 1. Compressed table Table 2.

Uncompressed table


Search step Step 1: look up index find columns/rows of the

database that contain the query keywords. Step 2: All potential subsets of tables in the

database that, if joined, might contain rows having all keywords, are identified and enumerated. Join Tree

Step 3: For each enumerated join tree, a SQL statement is constructed (and executed) that joins the tables in the tree and selects those rows that contain all keywords. The final rows are ranked and presented to the user.


Join Tree example:


Keyword Search in Databases: The Power of RDBMS

Lu Qin et al. SIGMOD 09

Integrating IR and DB

DB techniques provide users with efficient ways to access structured data in RDBMSs

IR techniques allow users to use keywords to access unstructured data

Eg. Structural keyword search, finds how tuples that contain keywords in a RDB are interconnected (the structure), three types:

Schema-based approach

Connected Tree Semantics: query results in minimal total joining network of tuples; adjacent tuples joined by foreign key reference, #tuples <= Tmax

Connected Tree Semantics

1. Candidate Network (CN) generation: relational algebra expressions that creates trees with all keywords up to a certain size

2. CN evaluation: evaluates generated CNs using SQL

Schema-based approach

Distinct Root Semantics: query results in collection of tuples all reachable from root; root uniquely defines tuples, distance(any tuple, root) <= Dmax

Schema-based approachDistinct Core Semantics: query results in multi-center subgraphs (communities); keyword tuples uniquely defines a community, distance(any keyword tuple, any center tuple) <= Dmax

Distinct Core/Root Semantics 1. Creates pairs between tuple

containing keyword and every other tuple, that is the shortest distance between them

2. generate graphs using SQL with distinct core/roots


Keyword search over relational databases: a metadata approach.

Bergamaschiet al. SIGMOD 11

Problem Definition

A database D is a collection of relational tables. Each relational table

contains its name, attributes and value domains. All these elements

together form the vocabulary.

A keyword query q is an ordered list of keywords. Each keyword

specifies the element of the interest.

A configuration of a keyword query on Database is an injective

mapping from the keyword to vocabulary of the database

Task: First derive the top configurations based on some metrics and

then interpret it as SQL query (select-project-join interpretations)

From Keywords to Queries

Need to consider inter-dependency of the query keywords:

Introduce two different kinds of weights: the intrinsic weights, and the

contextual weights

Need to give a ranked list of all the configurations

Develop an algorithm based on and extends the Hungarian (a.k.a.,

Munkres) algorithm

Need to separate the process of evaluating the schema terms and

value terms

Evaluate the value weights based on the schema mapping

Contributions and Insights

Formally define the problem of keyword querying over relational

databases that lack a-priori access to the database instance

Introduce the notion of a weight as a measure of the likelihood that

the semantics of a keyword are represented by a database structure.

Need to consider both intrinsic weights and contextual weights

Extend and exploit the Hungarian (a.k.a., Munkres) algorithm to

generate a ranking of different interpretations.

Literature Overview

Two kinds of approaches 1. Interpretative approach

Reuse database query language and index Translate the keywords into queries

2. Un-interpretative approach Typically build own index and data structure Model as graph and use graph-based

analysis

Literature Overview – Un-interpretative approach

Effective Keyword Search in Relational Databases

Fang Liu et al. SIGMOD 06

Difficulties of Keyword Search Keyword search in text databases only

need to compute score for each document

Keyword search on RDBMS more complicated (relations, attributes, tuples):

1. Generate tuple trees (answers) by joining tuples from different tables

2. Rank the answers by computing score

Generate Answer Tuple Trees

Tuple tree answer rules:1. Each leaf node in a tuple tree must contain at

least one keyword2. Each tuple only appears at most once in tree

Separate tuples into tuple sets that contain keywords and tuple sets that contain all tuples for each relation, join adjacent sets from schema graph within constraints of answer trees

Ranking Tuple Trees

Treat the text of each tuple within an answer set as a “document”

Assign similarity rating between each document and query, normalizing for: Term Frequency Document Frequency Document Length

Compute score for tuple tree as average over all documents

Focused work

Keyword Searching and Browsing in Databases using BANKS

Gaurav Bhalotia et al. ICDE 02

BANKS (Browsing And Keyword Searching)

a system which enables keyword-based search on relational databases, together with data and schema browsing

User HTTPBANKS System JDBC Database

Database and Query Model

Relational Database -> Directed Graph

Each Tuple in Database -> Node in Graph

Foreign Key -> Directed Edge



An answer to a query should be a subgraph connecting nodes matching the keywords.

The importance of a link depends upon the type of the link i.e. what relations it connects and on its semantics

Ignoring directionality would cause problems because of “hubs” which are connected to a large numbers of nodes.


We may restrict the information node to be from a selected set of nodes of the graph

We incorporate another interesting feature, namely node weights, inspired by prestige rankings

Node weights and tree weights need to be combined to get an overall relevance score

Formal Model

Node Weight : N(u) Depends on the prestige Set the node prestige = the in-

degree of the node Nodes that have multiple pointers

to them get a higher prestige

Formal Model

Edge Weights Some pupluar tuples can be connected

many other tuples Edge with forward and backward edge weights

Weight of a forward link = the strength of the proximity relationship between two tuples (set to 1 by default)

Weight of a backward link = in-degree of edges pointing to the node

Formal Model

Graph Score = Node Score + Edge Score

Node Score = Sum of node weight

Edge Score = 1 / (1 + Sum of Edge Weight)

Multiplicative:

Result

Result of query “sudarshan soumen”

Searching for the best answer Backward Expanding Search

Algorithm Intuition: find vertices from which a

forward path exists to at least one node from each Si.

Run concurrent single source shortest path algorithm from each node matching a keyword

Searching for the best answer

S. Sudarshan

Prasan Roy

writes

author

paper

Charuta

BANKS: Keyword search…

As an extension of BANKS

BLINKS: ranked keyword searches on graphs.

He H et al. SIGMOD 07

Introduction

Efficient ranked keyword searches on schemaless node-labeled graphs.

Challenges: Lack of schema for optimization Hard to guarantee strong performance

Proposed technique Backward search algorithm SLINKS: single-level index search * Extension for scalability: BLINKS ( bi-level index search )

Contributions Cost-balanced expansion based backward search Combining indexing with search Partition-based indexing (bi-level indexing)

Problem Formulation

Given search query = () and a directed graph G, an answer to q is to q is a pair < >, where r and ’s are nodes in G, satisfying:

Coverage: For every , node contains keyword Connectivity: For every , there exists a directed path in G from

r to How to measure which result is better?

A score function measuring both the graph structure and content.

Major part is the combined shortest distance between r and

Backward search algorithm

1. For every keyword maintain a cluster denote the set of nodes that can reach query keyword .

2. Initially, starts out as the set of nodes that directly contain ; we call this initial set the cluster origin and its member nodes keyword nodes.

3. In each search step, we choose an incoming edge to expand all clusters.

Example:

c’s cluster:

Initially: {3,12}

After chosen 5->12: {3,5,12}. All the incoming edge for 5 is visible for the algorithm now.

How to choose which edge to be included?

Expand a cluster by the node with the shortest distance to

How to choose which cluster to expand?

To expand the cluster with smallest cardinality

A single level index

Use precomputation to enhance online performance To find out the shortest distance to , compute the keyword-

node lists Keyword Node List: a ordered list of nodes that can reach

keyword .

(dist, node, first, knode) dist: distance between node and a node containing w first: the first node on the shortest path knode: a node containing w for the shortest distance

A single level index

Use precomputation to enhance online performance To augment backward search with forward expansion( to look

forward from current state to see whether we have found an answer )

Node-Keyword Map: returns the shortest distance from u to w ( inf for cannot reach)

SLINKS Algorithm Expanding backwards:

Use a cursor to traverse each keyword-node list , which gives the next node to expand in each cluster. Across clusters, we pick a cursor to.

Across cluster, pick a cursor to expand next in a round-robin manner based on the cost-balanced expansion among clusters

Expanding forward: As soon as we visit a node, we look up its distance to the other

keywords using . We can immediately determine if we have found the root of an answer. (none of the distance to w is inf)

Stopping: Maintain a threshold t, which is the current k-th shortest combined

distance among all known answer root. The combined distance for any unvisited node is at least .

BLINKS ( brief idea)

The index is too large to store and too expensive to construct in large graphs?

Use a divide and conquer approach to create a bi-level index Partition the data graph into multiple subgraphs, or blocks. Intra-Block Index

indexes information inside a block 4 kinds of index, 2 for separator nodes (important, so specially

considered ) Block Index

2 simple index

Conclusion

Keywords search challenges: Filtering and disambiguation Automatic join back Ranking of the result

Additional consideration: Efficiency Space

Thank you and have fun

Technology

Presentation