Dynamic Faceted Search for Discovery-driven Analysis

Dynamic Faceted Search for Discovery-

driven Analysis

Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman

CIKM’08

Speaker: Li, Huei-JyunAdvisor: Dr. Koh, Jia-Ling

Date: 2008/12/18

Outline

Introduction Terminology and Problem Statement Measure of “Interestingness” Implementing Dynamic Faceted Search Evaluation Conclusion and Future work

Introduction

Today’s faceted search systems are designed for browsing catalog data and are not directly suitable for discovery-driven exploration To preserve browsing consistency, facets

selected for navigation tend to be “static” When browsing online catalogs, the navigational

facets are single-dimensional only

Introduction

Propose a dynamic faceted search system for the kind of discovery-driven analysis that is often performed in On-Line Analytical Processing (OLAP) systems

From a potentially large search result, this paper wants to automatically and dynamically discover a small set of facets and values that are deemed most “interesting” to a user

Terminology and Problem Statement

Defn 1. A repository D is a collection of documents

Each of which is composed of some free text and one or more <facet: value> pairs

Given a value f in facet F, we call <F ： f> an instance of F All unique values associated with a facet F form

the domain of F

Defn 2. Organize the domain of these facets into a facet

hierarchy Each node in the hierarchy stores a <facet: value>

pair A node <F1: f1> is the parent of another node <F2: f2>

if for each document, F2 = f2 implies F1 = f1

Defn 3. Assume a query q on the repository has the

form “keywords && F1 = f1 && F2 = f2…”

The result of q is denoted by Dq Includes the set of documents having the

specified keywords Satisfying all constraints on selected facets

Defn 4. Given a query q, define a facet summary for

a facet set F1, …, Fm as a list of tuples <f1, …, fm, A(f1, …, fm)> over Dq

fi is an instance of facet Fi

A(f1, …, fm) is an aggregate of documents in Dq that contain all these facet instances

Problem Definition: Given a repository of documents with n

facets, a query q, 2 integers K1 & K2

select K1 facet sets and a facet summary for each with up to K2 tuples that are the most “interesting” to a user

Measure of “Interestingness”

Interestingness: How surprising an actual aggregated value is, given a certain expectation

Measure of “Interestingness”*Setting the Expectation

For a given set of facet values f1, …, fm from F1, …, Fm: CD(f1, …, fm ): the count of the number of

documents with all those facet values in D Cq(f1, …, fm ): the count of the number of documents

with all those facet values in Dq

E[Cq(f1, …, fm )]: an “expected” value for Cq(f1, …, fm ) Natural 、 navigational 、 ad hoc

Natural: For an individual facet instance <F ： f>:

(uniformity assumption)

For an instance f1, …, fm of a facet set: (independence assumption)

Navigational:

Ad hoc: User can tell the system to set expectation based

on an arbitrary query q of the user’s choice Set the count for each facet value proportionally

based on the distribution of the result of q13

Measure of “Interestingness”*Measuring Degree of Interestingness

Single facet instance: By evaluating it with respect to a scenario in

which its associated count is generated by random sampling

The smaller the probability of observing the count under random sampling, the more interesting the facet instance

p-value: Suppose that a certain facet value occurs in r out of

R documents in the repository and in q out of Q documents in the output of a certain query

Also suppose The interestingness of that facet value vis-à-vis the

query: the probability that in a random sample of size Q there will be at least q documents with that facet value hypergeometric distribution normal distribution or

Poisson distribution15

The whole facet: For each facet F, we consider the p-values of only the

k most interesting values in F , replace

The final measure:

MaxWeight: assign 1 to w1 and 0 to the rest

AvgWeight: assign each wi an equal weight HybridWeight: average the interesingness computed by

MaxWeight and AvgWeight16

Implementing Dynamic Faceted Search

Solr: indexes facets without storing them Enumerates every facet instance <F: f> from the

index and intersects its posting list with Dq

From the intersected set, it derives the count on facet value f

Caches each posting list to a bitset If the bitset is dense: bitmap Otherwise: a hash map of document IDs

Improving Solr: Solr limitation 1: has to choose a threshold that

decides the representation of the bitset represent a bitset as a compressed bitmap

using Word-Aligned Hybrid (WAH) code

WAH There are 2 types of words:

Literal words: a verbatim representation of 31 bits Fill words: encodes the length of a list of all 0’s and 1’s

in 30 bits A bitmap is broken into groups of 31 bits first and

then converted into a sequence of literal and fill words

Operations on bitmaps such as intersection can be performed on WAH code directly without decoding

Improving Solr: Solr limitation 2: it has to intersect the matching

document set Dq with the bitset of every facet instance

reduce the number of intersections by building a directory structure called bitset tree on top of the bitsets of a facet

Building and Using a Bitset Tree Starting with the leaf nodes, for each bitset b

corresponding to facet instance <F: f>, we create an entry <b, null>

Then divide all entries into groups of size s For each group, we generate a leaf node holding all

entries in that group

Evaluation*Setup

DBLP Contains about 13,000 papers published in 26 venues

(e.g., SIGMOD, VLDB, TODS, etc) in the past 30 years

It has 14 facets organized in 6 hierarchies, including author, venue, time (e.g., decade, year), location (e.g., country, city), number of authors per paper, number of citations per paper

Use the title of each paper as text for keywords searches

Conduct the user survey22

Evaluation*Setup

Patent Has about 1.8 million

U.S. patents from the past 30 years

16 facets organized into 10 hierarchies

Use for performance evaluation

Evaluation*Result from a User Survey

Performed tests on 3 keyword queries 2 are provided by author:

“distributed”, “mining” Users pick the 3 keyword

1 base on natural 2 base on navigational

1 used complete repository 1 used previous query

Evaluation*Result from a User Survey

Our dynamic approach also received some negative feedback

Overall, the feedback for the natural expectation is neutral

Different ways of aggregating the degree of interestingness HybridWeight(7) > MaxWeight(6) > AvgHeight(2)

Evaluation*Performance Results

Environment: Implemented in Java 3GHz P4 desktop machine with 1GB memory A single disk drive, running Linux

Version:1. simple: inverted index2. Solr3. compressed: improves Solr by WAH code4. tree: improves Solr by bitset trees5. compressed-tree: both WAH and bitset tree on Solr

Scaling with Data Size Run a query that matches 25,000 docs using tree Break the total time into search time & summary

computation time28

Conclusion and Future Work

Develop a novel dynamic faceted search system support OLAP-style discovery-driven analysis on a large set of structured and unstructured data

Propose an intuitive and effective way of measuring “interestingness”

Propose a novel navigational ,method of setting a user’s expectation

Conclusion and Future Work

Incorporate user feedback in facet selection How to extend the aggregates to functions

other than count Sum, average on some numerical measures

How to support dynamic faceted search in a distributed environment

Dynamic Faceted Search for Discovery-driven Analysis

Documents

Chemical Proteomics-driven Discovery of Oleocanthal as

DISCOVERY DRIVEN - Mindoro · Discovery Driven Strategy for 2006Discovery Driven Strategy for ... Survey and Presentation by: McPHAR GEOSERVICES (PHILIPPINES), ... Pica …

Evaluation of Data-Driven Causality Discovery Approaches

Dynamic Multi-Faceted Topic Discovery in Twitter

DATA DRIVEN MODEL DISCOVERY AND CONTROL OF …

Visual discovery and model-driven explanation of …as2006/files/sarkar_2016_gatherminer.pdfVisual discovery and model-driven explanation of time series patterns ... series analysis

Query Recommendations for OLAP Discovery-driven …negre/fichiers_joints/IJDWM10.pdf · Query Recommendations for OLAP Discovery-driven Analysis Arnaud Giacometti, Patrick Marcel,

Accelerating Data-driven Discovery in Energy Science

Transforming Science Through Data -driven Discovery ... · Transforming Science Through Data-Driven Discovery Vision: Transforming science through data -driven discovery Mission:

Unsupervised Constraint Driven Learning for Transliteration Discovery

Minimum-Effort Driven Dynamic Faceted Search in Structured Databases

Semantic Driven Service Discovery for Interoperability in Web

AI-DRIVEN ANTIBODY DISCOVERY AT EVOTEC

Chemogenomics driven discovery of endogenous polyketide

GEOPHYSICS Machine learning for data-driven discovery in

Comparing effectuation to discovery-driven planning, prescriptive entrepreneurship … · Comparing effectuation to discovery-driven planning, prescriptive entrepreneurship, business

Faceted Search for Hydrologic Data Discovery

Array study discovery-driven Where is the hypothesis ?

Eugene Istomin - Event-driven Service Discovery

Molecular dynamics-driven drug discovery: leaping forward ...csmres.co.uk/...driven-drug-discovery-leaping-forward-with-confidenc… · Drug Discovery Today Volume 22,Number 2 February