View
38
Download
0
Category
Preview:
DESCRIPTION
Dynamic Faceted Search for Discovery-driven Analysis. Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling Date: 2008/12/18. Outline. Introduction Terminology and Problem Statement Measure of “Interestingness” - PowerPoint PPT Presentation
Citation preview
Dynamic Faceted Search for Discovery-
driven Analysis
Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman
CIKM’08
Speaker: Li, Huei-JyunAdvisor: Dr. Koh, Jia-Ling
Date: 2008/12/18
1
Outline
Introduction Terminology and Problem Statement Measure of “Interestingness” Implementing Dynamic Faceted Search Evaluation Conclusion and Future work
2
Introduction
Today’s faceted search systems are designed for browsing catalog data and are not directly suitable for discovery-driven exploration To preserve browsing consistency, facets
selected for navigation tend to be “static” When browsing online catalogs, the navigational
facets are single-dimensional only
3
Introduction
Propose a dynamic faceted search system for the kind of discovery-driven analysis that is often performed in On-Line Analytical Processing (OLAP) systems
From a potentially large search result, this paper wants to automatically and dynamically discover a small set of facets and values that are deemed most “interesting” to a user
4
Terminology and Problem Statement
Defn 1. A repository D is a collection of documents
Each of which is composed of some free text and one or more <facet: value> pairs
Given a value f in facet F, we call <F : f> an instance of F All unique values associated with a facet F form
the domain of F
5
Terminology and Problem Statement
Defn 2. Organize the domain of these facets into a facet
hierarchy Each node in the hierarchy stores a <facet: value>
pair A node <F1: f1> is the parent of another node <F2: f2>
if for each document, F2 = f2 implies F1 = f1
6
Terminology and Problem Statement
Defn 3. Assume a query q on the repository has the
form “keywords && F1 = f1 && F2 = f2…”
The result of q is denoted by Dq Includes the set of documents having the
specified keywords Satisfying all constraints on selected facets
7
Terminology and Problem Statement
Defn 4. Given a query q, define a facet summary for
a facet set F1, …, Fm as a list of tuples <f1, …, fm, A(f1, …, fm)> over Dq
fi is an instance of facet Fi
A(f1, …, fm) is an aggregate of documents in Dq that contain all these facet instances
8
Terminology and Problem Statement
Problem Definition: Given a repository of documents with n
facets, a query q, 2 integers K1 & K2
select K1 facet sets and a facet summary for each with up to K2 tuples that are the most “interesting” to a user
9
Measure of “Interestingness”
Interestingness: How surprising an actual aggregated value is, given a certain expectation
10
Measure of “Interestingness”*Setting the Expectation
For a given set of facet values f1, …, fm from F1, …, Fm: CD(f1, …, fm ): the count of the number of
documents with all those facet values in D Cq(f1, …, fm ): the count of the number of documents
with all those facet values in Dq
E[Cq(f1, …, fm )]: an “expected” value for Cq(f1, …, fm ) Natural 、 navigational 、 ad hoc
11
Measure of “Interestingness”*Setting the Expectation
Natural: For an individual facet instance <F : f>:
(uniformity assumption)
For an instance f1, …, fm of a facet set: (independence assumption)
12
Measure of “Interestingness”*Setting the Expectation
Navigational:
Ad hoc: User can tell the system to set expectation based
on an arbitrary query q of the user’s choice Set the count for each facet value proportionally
based on the distribution of the result of q13
Measure of “Interestingness”*Measuring Degree of Interestingness
Single facet instance: By evaluating it with respect to a scenario in
which its associated count is generated by random sampling
The smaller the probability of observing the count under random sampling, the more interesting the facet instance
14
Measure of “Interestingness”*Measuring Degree of Interestingness
p-value: Suppose that a certain facet value occurs in r out of
R documents in the repository and in q out of Q documents in the output of a certain query
Also suppose The interestingness of that facet value vis-à-vis the
query: the probability that in a random sample of size Q there will be at least q documents with that facet value hypergeometric distribution normal distribution or
Poisson distribution15
Measure of “Interestingness”*Measuring Degree of Interestingness
The whole facet: For each facet F, we consider the p-values of only the
k most interesting values in F , replace
The final measure:
MaxWeight: assign 1 to w1 and 0 to the rest
AvgWeight: assign each wi an equal weight HybridWeight: average the interesingness computed by
MaxWeight and AvgWeight16
Implementing Dynamic Faceted Search
Solr: indexes facets without storing them Enumerates every facet instance <F: f> from the
index and intersects its posting list with Dq
From the intersected set, it derives the count on facet value f
Caches each posting list to a bitset If the bitset is dense: bitmap Otherwise: a hash map of document IDs
17
Implementing Dynamic Faceted Search
Improving Solr: Solr limitation 1: has to choose a threshold that
decides the representation of the bitset represent a bitset as a compressed bitmap
using Word-Aligned Hybrid (WAH) code
18
Implementing Dynamic Faceted Search
WAH There are 2 types of words:
Literal words: a verbatim representation of 31 bits Fill words: encodes the length of a list of all 0’s and 1’s
in 30 bits A bitmap is broken into groups of 31 bits first and
then converted into a sequence of literal and fill words
Operations on bitmaps such as intersection can be performed on WAH code directly without decoding
19
Implementing Dynamic Faceted Search
Improving Solr: Solr limitation 2: it has to intersect the matching
document set Dq with the bitset of every facet instance
reduce the number of intersections by building a directory structure called bitset tree on top of the bitsets of a facet
20
Implementing Dynamic Faceted Search
Building and Using a Bitset Tree Starting with the leaf nodes, for each bitset b
corresponding to facet instance <F: f>, we create an entry <b, null>
Then divide all entries into groups of size s For each group, we generate a leaf node holding all
entries in that group
21
Evaluation*Setup
DBLP Contains about 13,000 papers published in 26 venues
(e.g., SIGMOD, VLDB, TODS, etc) in the past 30 years
It has 14 facets organized in 6 hierarchies, including author, venue, time (e.g., decade, year), location (e.g., country, city), number of authors per paper, number of citations per paper
Use the title of each paper as text for keywords searches
Conduct the user survey22
Evaluation*Setup
Patent Has about 1.8 million
U.S. patents from the past 30 years
16 facets organized into 10 hierarchies
Use for performance evaluation
23
Evaluation*Result from a User Survey
Performed tests on 3 keyword queries 2 are provided by author:
“distributed”, “mining” Users pick the 3 keyword
1 base on natural 2 base on navigational
1 used complete repository 1 used previous query
24
Evaluation*Result from a User Survey
25
Evaluation*Result from a User Survey
Our dynamic approach also received some negative feedback
Overall, the feedback for the natural expectation is neutral
Different ways of aggregating the degree of interestingness HybridWeight(7) > MaxWeight(6) > AvgHeight(2)
26
Evaluation*Performance Results
Environment: Implemented in Java 3GHz P4 desktop machine with 1GB memory A single disk drive, running Linux
Version:1. simple: inverted index2. Solr3. compressed: improves Solr by WAH code4. tree: improves Solr by bitset trees5. compressed-tree: both WAH and bitset tree on Solr
27
Evaluation*Performance Results
Scaling with Data Size Run a query that matches 25,000 docs using tree Break the total time into search time & summary
computation time28
Evaluation*Performance Results
29
Evaluation*Performance Results
30
Conclusion and Future Work
Develop a novel dynamic faceted search system support OLAP-style discovery-driven analysis on a large set of structured and unstructured data
Propose an intuitive and effective way of measuring “interestingness”
Propose a novel navigational ,method of setting a user’s expectation
31
Conclusion and Future Work
Incorporate user feedback in facet selection How to extend the aggregates to functions
other than count Sum, average on some numerical measures
How to support dynamic faceted search in a distributed environment
32
Recommended