Upload
tess98
View
153
Download
0
Tags:
Embed Size (px)
Citation preview
Effective Keyword Based Selection of Relational
DatabasesBei Yu, Guoliang Li, Karen Sollins,
Anthony K.H Tung
Overview
• What is unstructured retrieval?
This is retrieving data from documents like journals, articles etc.
• What is structured retrieval?
Retrieving data from databases, XML files etc. (that is, structural relationship between data exists)
Traditional IR approach
• Use keyword frequency and document frequency statistics for query words to determine relevance of a document– Keyword frequency – No. of times a keyword
appears in a document– Document frequency – No. of documents in
which a keyword appears.
• Use the combination of the two as a weighting factor
Traditional IR technique is inadequate for relational databases• Traditional IR techniques do not capture the
relationship between data sources in a normalized database
• Need to take into account the relationship between keywords in a database
• Example:– A keyword is in a tuple referenced by many other
tuples– No. of joins that need to be performed to get all
keywords in a query
ExampleDB1
Inproceedings Conferences
id inprocID title procID year mon annote
t1 Adiba1986 Historical Multimedia Databases
23 1988 Aug temporal
t2Abarbanel1987 Connection
Perspective
Reform
18 1987 May Intellicorp
id procID Conference
t3 23 The conference on Very Large Databases (VLDB)
t4 18 ACM Sigmod Conf on management of data
Example
DB2
Example
Query = (Multimedia, Database, VLDB)• DB1 will give us good results,• But traditional IR model will return DB2 as the
better one as term frequencies are higher in DB2
• Hence we need to effectively summarize relationships between keywords in databases
Contributions
1) Address the problem of selection of structured data sources for keyword based queries
2) Propose a method for summarizing relationships between keywords in a database
3) Define metrics to rank source databases given a keyword query based on keyword relationships
4) Evaluation of proposed summarization using real datasets
Measuring Strength of Relationships Between Keywords
• Strength of relationships between two keywords measured as a combination of two factors:
1) Proximity factor – Inverse of distance
2) Frequency factor, given a distance d – Number of combinations of exactly d+1 distinct tuples that can be joined in a sequence to get the two keywords in the end tuples
Modeling of an RDBMS
• Let m = No. of distinct keywords in database DB• Let n = Total no. of tuples in DB.• Then matrix D = t1 t2 …. tn
k1
k2
:
:
km
• D represents presence or absence of a keyword in a tuple (Similar to term-document incidence matrix in VSM)
Modeling of an RDBMS Cont’d
• Matrix T represents relationship between tuples(for example, foreign key)
T= t1 t2 ……………… tn
t1 0 1
t2 1 0
:
:
tn
Mathematical representation of keyword relationships
ji
jid
k and k connect
to sequences joining distance-d offrequency )k,(kω
), d (0 d distance each For 3)
results ofquality the control to user a Enables
database the from expected results of no. Maximum K 2)
operators join allowed
of number maximum denoting parameter supplied User 1)
Mathematical representation of keyword relationships Cont’d
• A Keyword Relationship Matrix (KRM) R represents the relationship between any two pair of keywords with respect to δ and K
1)1/(d ψ where, )k,(kω *ψ r j]R[i,
K, )k,(kω When1)
djid d
δ
0d
ij
ji
δ
0d
d
Mathematical representation of keyword relationships Cont’d
) 1d 1/( ψ e wher
, ))k,k (ω -(K * ψ )k,k (ω*ψ r ] ji, R[
K )k,k (ωandK )k,k (ωδ δ' have we
K, )k,k (ω When2)
d
δ'-1
0d
ji dδ' ji d d
δ'-1
0d
ij
ji
δ'-1
0d
d ji
δ'
0d
d,
ji
δ
0d
d
Example
• For two given keywords k1 and k2, and K=40
• Database A has 5 joining sequences connecting them at distance = 1
Then score = 5 * (1/2) = 2.5
• Database B has 40 joining sequences connecting them at distance = 4
Then score = 40*(1/5) = 8
• Here B wins.
Example (cont’d)
• If we bring down K to 10, then A wins.
• Thus one may prefer A to B due to better quality.
• K defines the number of top results users expect from the database.
Computation of KRM
How to compute
Few definitions –
•
)k,(kω jid
otherwise 0 i][j,T j][i,T and 2)
d, distance of is t and t tuples two the connect to sequence
joining shortest the ifonly and if 1 i][j,T j][i,T 1)
j,i and nji,1any for that such
entriesbinary hmatrix witsymmetric a is n)(nT
as denoted matrix, iprelationsh tuple distance-d
dd
ji
dd
d
Three proven propositions aiding the computation of the KRM
1 j][r,T*r][i,T, n)r(1 r and 0 j][i, T if 1
1 j][i, T if 0j][i,T
d1k T T supposing and T, T given
-:2 nPropositio
0 j][i,T then 1, j][i,T if
)d (d d,d and j)(i ji,any For
- 1: nPropositio
1d *d
*d1d
k *d1
dd
2121
21
Three proven propositions aiding the Computation of KRM Cont’d
j][i, W )k,(kω
j,i and mj i,1 j,i, have We
matrix) iprelationsh tuple the is T (where
DTT D Wlet 1,d For 2)
j][i, W )k,(kω
j,i and mj i,1 j,i, have We1)
matrix) incidence keyword is D whereD, of transpose is (DT
DT D WLet
-: 3 nPropositio
djid
d d
0ji0
0
Comparison of frequencies of keyword pairs in DB1 and DB2
Frequencies of keyword pairs in DB1
Frequencies of keyword pairs in DB2
Our query was Q = (Multimedia, Database, VLDB )
Observation tells us that query words are more closely related in DB1
Keyword pair d=0 d=1 d=2 d=3 d=4
database:multimedia 1 1 - - -
multimedia:VLDB 0 1 - - -
Database:VLDB 1 1 - - -
Keyword pair d=0 d=1 d=2 d=3 d=4
database:multimedia 0 0 0 0 2
multimedia:VLDB 0 0 0 0 0
Database:VLDB 0 0 1 0 0
Comparison of relationship scores of DB1 and DB2
Keyword pair DB1 DB2
Database:multimedia 1.5 0.4
Multimedia:VLDB 0.5 0
Database:VLDB 1.5 0.33
• Sample computation for DB1 (K=10)
Rel [ Database, multimedia ] = 1 * 1 + 0.5 * 1 = 1.5
Implementation with SQL
• Relation RD(kId, tId) represents the non-zero entries of the keyword incidence matrix D
• kId is the keyword ID and tId is the tuple ID
• RK(kId, keyword) stores the keyword IDs and keywords (similar to a word dictionary in IR)
• Matrices T1, T2, T3... (Tuple relationship matrices) are represented with relations RT1,RT2 ,RT3..
• RT1 :- Produced by joining pairs of tables
• RT2 :- Produced by self-joining RT1
Implementation with SQL Cont’dRT3 produced using the following SQLs
INSERT INTO RT3 (tId1, tId2) SELECT s1.tId1, s2.tId2 FROM RT2 s1, RT1 s2 WHERE s1.tId2 = s2.tId1
INSERT INTO RT3 (tId1, tId2) SELECT s1.tId1, s2.tId1 FROM RT2 s1, RT1 s2 WHERE s1.tId2 = s2.tId2 AND s1.tId1 < s2.tId1
INSERT INTO RT3 (tId1, tId2) SELECT s2.tId1, s1.tId2 FROM RT2 s1, RT1 s2 WHERE s1.tId1 = s2.tId2
Implementation with SQL Cont’d
INSERT INTO RT3 (tId1, tId2)
SELECT s1.tId2, s2.tId2
FROM RT2 s1, RT1 s2
WHERE s1.tId1 = s2.tId1 AND s1.tId2 < s2.tId2
DELETE a FROM RT3 a, RT2 b, RT1 c
WHERE (a.tId1 = b.tId1 AND a.tId2 = b.tId2) OR
(a.tId1 = c.tId1 AND a.tId2 = c.tId2)
• In general, RTd is generated by joining RTd-1 with RT1
and excluding the tuples already in RTd-1, RTd-2, … RT1
Creation of W0,W1, W2….(Matrices representing frequencies)
• W0 is represented with a relation RW0(kId1, kId2, freq)
• tuple (kId1, kId2, freq) records the pair of keywords (kId1,kId2) (kId1 < kId2), and its frequency (freq) at 0 distance, where freq is greater than 0.
• RW0 is the result of self-joining RD (kId, tId).
• SQL for creating RW0
INSERT INTO RW0 (kId1, kId2, freq) SELECT s1.kId AS kId1, s2.kId AS kId2, count(*) FROM RD s1, RD s2 WHERE s1.tId = s2.tId AND s1.kId < s2.kId GROUP BY kId1, kId2
Creation of W0,W1, W2….(Matrices representing frequencies)
• SQL for creating RWd , d > 0
INSERT INTO RWd (kId1, kId2, freq) SELECT s1.kId AS kId1, s2.kId AS kId2, count(*)
FROM RD s1, RD s2, RTd r WHERE ((s1.tId = r.tId1 AND s2.tId = r.tId2) OR (s1.tId = r.tId2 AND s2.tId = r.tId1)) AND s1.kId < s2.kId GROUP BY kId1, kId2
Final resulting KRM
• The final resulting KRM, R is stored in a relation RR(kId1,kId2),consisting of pairs of keywords and their relationship score.
• It is computed using the formula –
• Update issues :-
The tables for storing these matrices can be updated dynamically.
)k,(kω * ψ j]R[i, j
δ
0didd
Estimating multi-keyword relationships
• Mutiple keywords are connected with Steiner trees.• It is an NP complete problem to find a minimum Steiner tree.• Most current keyword search algorithms rely on
heuristics to find top-K results.• Hence estimation between multiple keywords estimated using derived keyword relationships described above.
Estimating multi-keyword relationships Cont’d
selection. from prunedsafely be can it that so
0 to set is score its so δ, than greater be must edges
keyword all containing tree tuple the of edges of no. the
summary, KR a in found not is keywords of pair a If 2)
jiq,ji,1 } } 0) )k,(kω&0d|min{d {max
than less no is Q in keywords the all
contain that T tree tuple the of edges of number the
},kkk{kQ keywords of set a Given 1)
4 nPropositio
ji d
Q
q,3,....,2,1,
Estimating multi-keyword relationships Cont’d
jiQ, }k,{k
)k,rel(kmax DB)(Q,rel 2)
formula estimation veconservati most the is This
jiQ, }k,{k
)k,rel(k min DB)(Q,rel 1)
-: scores of sestimation of kinds four use can We
ji
jimax
ji
ji min
Estimating multi-keyword relationships Cont’d
onintersecti of degree highest the assumes formula This
jiQ, }k,{k
)k,rel(k DB)(Q,rel 4)
jiQ, }k,{k
)k,rel(k DB)(Q,rel 3)
ji
ji prod
ji
ji sum
Database ranking and indexing
• With KR summary, we can effectively rank a set of databases D = {DB1,DB2,…,DBN} for a given keyword query.
• We can use either a global index or a local index• Global Index –
1. Analogous to an inverted index in IRUse keyword pairs as key, and <database Id, relationship score> as a postings entry
2. To evaluate a query, fetch the corresponding inverted
lists, and compute the score for each database.
)DBrel(Q, )DBrel(Q, )rank(DB)rank(DB 2121
Database ranking and indexing Cont’d
• Decentralized index
1. Each machine can store a subset of the index (that is, keyword pairs and inverted lists)
2. When a query is received at a node, search messages are sent across nodes and the corresponding postings lists are retrieved.
Experiments done to evaluate efficiency of this system
• K-R score compared with score from brute force method (real_rank) over 82 databases spread across 16 nodes.
• Effectiveness of this technique has been successfully established over distributed databases
Definitions used for comparison :-
Q to T of relevance measures Q) ,(T Score and
Q,query given result top ith T where
Q), ,(T Score as defined is real_score where
),DB (Q, real_score )DB (Q, real_score )(DB real_rank )(DB real_rank 1)
ii
i
k
1i
i
jiji
Experiments done to evaluate efficiency of this system
relevant) of (Number / retrieved) relevant of (Number recall IR, In (
database the of score real the is DB)(Q, Score and
ly,respective rankings real and basedsummary denote R and S where
DB) (Q, Score DB) (Q, Score (l) recall 2)(R)Top DB(S)Top DB ll
/
retrieved) of Number / relevant of Number (
| (R)Top | / | } 0 ) DB (Q, Score | (S)Top DB { | (l) precision 3) l l
Experiments done to evaluate efficiency of this system Cont’d
• Effects of (length of joining sequence)
1) Selection performance of keyword queries generally gets
better when grows larger.
2) Precision and recall values for different values tend to cluster into groups
3) There are big gaps in both precision and recall values
when and when is greater
δ
δ
1 0 δ
Experiments done to evaluate efficiency of this system Cont’d
Recall and precision of 2-keyword queries using KR summaries and KF-summaries
Experiments done to evaluate efficiency of this system Cont’d
• Effects of number of query keywords – 1) Performance of 2-keyword queries generally better than 3-keyword and 4-keyword queries 5-keyword queries give better recall than 3 and 4 keyword queries
as they are more selective 2) Generally, the difference in the recall of queries with different no. of keywords is less than that of the precision This shows that the system is effective in assigning high ranks to useful databases, although less relevant or irrelevant databases
may also be selected.
Comparison of four kinds of estimations
(MIN,MAX,SUM,PROD)• SUM and PROD have similar behavior
and outperform the other two methods• Hence it is more effective to take into account
relationship information of every keyword pair in the query when estimating overall scores
Experiments done to evaluate efficiency of this system Cont’d
Recall and precision of K-R summaries using different estimations ( )3
Experiments done to evaluate efficiency of this system Cont’d