View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Building Optimal Websites with the Constrained Subtree
Selection Problem
Brent Heeringa(joint work with Micah Adler)
09 November 2004
A website design problem(for example: a new kitchen store)
Given products, their popularity, and their organization:
How do we create a good website?Navigation is naturalAccess to information is timely
paring chef bread steak
Wüstof Henkels
Knives
Type Maker
0.26 0.33 0.27 0.14
Good website: Natural Navigation
Organization is a DAG
TC of DAG enumerates all viable categorical relationships and introduces shortcuts
Subgraph of TC preserves logical relationship between categories
Transitive Closure
Subgraph of TC
TC
Good website: Timely Access to Info
Two obstacles to finding info quickly Time scanning a page for correct link Time descending the DAG
Associate a cost with each obstacle Page cost (function of out-degree of
node) Path cost (sum of page costs on path)
Good access structure: Minimize expected path cost Optimal subgraph is always a full tree
1/2
Page Cost = # links Path Cost = 3+2=5Weighted Path Cost = 5/2
Constrained Subtree Selection (CSS)
An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves
(constraint graph) is a function of the out-degree of
each internal node (degree cost) w is a probability distribution over
the n leaves (weights)
A solution is any directed subtree of the transitive closure of G which includes the root and leaves
An optimal solution is one which minimizes the expected path cost
1/4 1/4 1/4 1/4
(x)=x
Constrained Subtree Selection (CSS)
An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves
(constraint graph) is a function of the out-degree of
each internal node (degree cost) w is a probability distribution over
the n leaves (weights)
A solution is any directed subtree of the transitive closure of G which includes the root and leaves
An optimal solution is one which minimizes the expected path cost
1/4 1/4 1/4 1/4
3(1/4)5(1/4)
5(1/4)
(x)=x Cost:4
3(1/4)
1/2 1/6 1/6 1/6
Constrained Subtree Selection (CSS)
An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves
(constraint graph) is a function of the out-degree of
each internal node (degree cost) w is a probability distribution over
the n leaves (weights)
A solution is any directed subtree of the transitive closure of G which includes the root and leaves
An optimal solution is one which minimizes the expected path cost
(x)=x Cost: 3 1/2
Constraint-Free Graphs and k-favorability
Constraint-Free GraphEvery directed, full tree with n leaves is a
subtree of the TC
CSS is no longer constrained by the graph
k-favorable degree cost Fix . There exists k>1 for any constraint-
free instance of CSS under where an optimal tree has maximal out-degree k
• Prefer binary structure when a leaf has at least half the mass
• Prefer ternary structure when mass is uniformly distributed
> 1/2
Linear Degree Cost - (x)=x
CSS with 2-favorable degree costs and C.F. graphs is Huffman coding problem Examples: quadratic, exp, ceiling of log
Results
Complexity: NP-Complete for equal weights and many Sufficient condition on Hardness depends on constraint graph
Highlighted Results: Theorem: O(n(k)+k)-time DP algorithm
is integer-valued, k-favorable and G is constraint free (x)=x
Theorem: poly-time constant-approximation: ≥1 and k-favorable; G has constant out-degree Approximate Hotlink Assignment - [Kranakis et. al]
Other results: Characterizations of optimal trees for uniform probability
distributions
Related Work Adaptive Websites [Perkowitz & Etzioni]
Challenge to the AI community Novel views of websites: Page synthesis problem
Hotlink Assignment [Kranakis, Krizanc, Shende, et. al.] Add 1 hotlink per page to minimize expected distance
from root to leaves Recently: pages have cost proportional to their size
Hotlinks don’t change page cost
Optimal Prefix-Free Codes [Golin & Rote] Min code for n words with r symbols where symbol ai has
cost ci
Resembles CSS without a constraint graph
INPUT: (X,C) X=(x1,…,xn) n=3k and C=(C1,…,Cm) Ci X
OUTPUT: C’ C where |C’|=k and covers X
QUESTION: Given K and (X,C) is there a cover of size K?
Exact Cover by 3-Sets
Sufficient condition on :
For every integer k, there exists an integer s(k) such that
Lopsided Trees
Recall: (x)=x, and G is constraint free
Node level = path cost
Adding an edge increases level
Grow lopsided trees level by level
Lopsided Trees
We know exact cost of tree up to the current level i:
Exact cost of m leaves Remaining n-m leaves must have path-cost at least i
Lopsided Trees
Exact cost of C: 3 • (1/3)=1
Remaining mass up to level 4: (2/3) • 4 = 8/3
Total: 1+8/3=11/3
Lopsided Trees
Tree cost at Level 5 in terms of Tree cost at Level 4: Add in the mass of
remaining leaves
Cost at Level 5: No new leaves 11/3+2/3=13/3
Lopsided Trees
Equality on trees: Equal number of leaves at or
above frontier Equal number of leaves at each
relative level below frontier
Nodes have outdegree ≤ 3 Node below frontier ≤ (3) (m;l1, l2, l3) = signature Example Signature: (2; 3, 2, 0)
2: C and F are leaves 3: G, H, I are 1 level past the frontier 2: J and K are 2 levels past the
frontier
Inductive Definition
Let CSS(m,l1,l2,l3) = min cost tree with sig. (m;l1, l2, l3)
Can we define CSS(m,l1,l2,l3) in terms of optimal substructures?
Which trees, when grown by one level, have signatures CSS(m,l1,l2,l3)?
Which signatures (m’,l’1,l’2,l’3) lead to (m,l1,l2,l3)
Sig: (0; 2, 0, 0)
Sig: (1; 0, 0, 3)
Growing a tree only affects frontierOnly l1 affects next levelChoose leavesThe remaining nodes are
internalChoose degree-2 (d2)
Remaining nodes are degree-3 (d3)
O(n2) choices
The other direction
The original question(warning: here be symbols)
Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) l’1 and d2 are sufficient
l’1 and d2 are both O(n)
O(n2) possibilities for (m’;l’1,l’2,l’3)
CSS(m,l1,l2,l3) = min cost tree with sig. (m;l1, l2, l3)
= CSS(m’,l’1,l’2,l’3) + cm’ for 1≤d2≤l’1≤n
(cm’ are the smallest n-m’ weights)
CSS(n,0,0,0) = cost of optimal tree Analysis:
Table size = O(n4) Each cell takes O(n2) lookups O(n6) algorithm
Lower Bound on Cost
Lemma: H(w)/log(k) is a lower bound on the cost of an optimal treeFor any k-favorable degree cost , with ≥1G is constraint-free
T
c(T) ≥ c’(T) ≥ c’(T’) ≥ H(w)/log(k) (shannon)
1 1 1
1T1 1 1
1T’
1
A Simple Lemma Lemma 2: For any tree with m weighted nodes there exists 1 node
(splitter) which, when removed, divides the tree into subtrees with at most half the weight of the original tree.
splitter
< 1/2 < 1/2
<1/2
Aproximation AlgorithmLet G be a DAG where out-degree of every node
dChoose a spanning tree T from GBalance-Tree(T):
Find a splitter node in T (Lemma 2) Stop if splitter is child of root
Disconnect the splitter and reconnect it to the root root has degree at most d+1
Call Balance-Tree on all subtrees
splitter
Mass of each subtree is at least half of whole tree
Approximation Algorithm
Analysis: Mass under any node is half of mass under its
grandparent Path length to leaf with weight wi is -2log(wi)
Theorem: O(m)-time O(log(k)(d+1))-approx to optimal solution
For any DAG G with m nodes and out-degree d For every k-favorable degree cost ≥ 1,
Upper Bound on Node Cost Weighted Path Length
Proposed Problem 1(CSS in constraint-free graphs, equal leaf weights)
Question: Polytime algorithm for CSS with:Constraint-free graphsEqual leaf weightsIncreasing degree cost
Good News:Characterizations for linear and log degree costsNear linear time algorithms for r-ary Varn Codes
(Huffman codes with r unequal letter costs, uniform probability distribution)
Varn Codes(infinite lopsided tree)
Note: Not the 5 highest Leaves!
5 Leaves
Symbol Costs = (3,3,3,8,8)
Varn Codes(infinite lopsided tree)
6 Leaves
Symbol Costs = (3,3,3,8,8)
Note: m internal nodes are the highest m nodes in the infinite tree
Proposed Problem 1(CSS in constraint-free graphs, equal leaf weights)
Bad News: No Notion of an infinite lopsided tree in CSS
Degree change = structure change Optimal CSS tree is fairly balanced Property:
No leaf may appear above the level of any other internal node Proof: If it were the case, we could switch branches and
decrease the cost of the tree
Intuition: There is some k which optimizes breadth-to-depth tradeoff. The optimal tree repeats this structure. Fringe requires some computation time.
Proposed Problem 2(Dynamic CSS)
CSS often applies to environments which are inherently dynamicWeb pages change popularityAccess patterns change on file systems
Question: Given a CSS tree with property P, how much time does it take to maintain P after an update?
P = minimum cost, approximation-ratio of min cost
Restrict attention toInteger leaf weights (rational distributions)Unit updates
Proposed Problem 2(Dynamic CSS)
Good News: Knuth (and later Vitter) studied Dynamic Huffman Codes (DHC)Motivation: One-pass encodingProtocol:
Both parties maintain optimal tree for first t characters
Encode and decode t+1 characterUpdate tree
Optimality of tree maintained in time proportional to encoding
DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:
The n leaves have nonnegative weights w1…wn
the weight of each internal node is the sum of the weights of its children
The nodes can be numbered in non-decreasing order by weight
siblings are numbered consecutively common parent has a higher number
21
11
5 6
10
5 53 4 5 6
7 8
10
A B
D E
32
2 31 2C
119
11
F
Numbering corresponds to merging in greedy algorithm
DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:
The n leaves have nonnegative weights w1…wn
the weight of each internal node is the sum of the weights of its children
The nodes can be numbered in non-decreasing order by weight
siblings are numbered consecutively common parent has a higher number
22
11
5 6
11
5 63 4 5 6
7 8
10
A B
D E
33
2 41 2C
119
11
F
What happens if we increase B?
Node 4 violates the Sibling Property
DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:
The n leaves have nonnegative weights w1…wn
the weight of each internal node is the sum of the weights of its children
The nodes can be numbered in non-decreasing order by weight
siblings are numbered consecutively common parent has a higher number
21
11
5 6
10
5 53 4 5 6
7 8
10
A B
D E
32
2 31 2C
119
11
F
Before updating: Exchange current node with node with highest number having the same weight
DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:
The n leaves have nonnegative weights w1…wn
the weight of each internal node is the sum of the weights of its children
The nodes can be numbered in non-decreasing order by weight
siblings are numbered consecutively common parent has a higher number
21
11
5 6
10
5 53 4 5 6
7 8
10
A B
D E
32
2 31 2C
119
11
F
Before updating: Exchange current node with node with highest number having the same weight
DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:
The n leaves have nonnegative weights w1…wn
the weight of each internal node is the sum of the weights of its children
The nodes can be numbered in non-decreasing order by weight
siblings are numbered consecutively common parent has a higher number
21
11
5 6
10
53 5 6
7 8
10
D E
32
C
119
11
F
Different, but still optimal, greedy choice when merging nodes
54
A B2 31 2
DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:
The n leaves have nonnegative weights w1…wn
the weight of each internal node is the sum of the weights of its children
The nodes can be numbered in non-decreasing order by weight
siblings are numbered consecutively common parent has a higher number
21
11
5 6
10
53 5 6
7 8
10
D E
32
C
119
11
F
Different, but still optimal, greedy choice when merging nodes
54
A B2 31 2
DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:
The n leaves have nonnegative weights w1…wn
the weight of each internal node is the sum of the weights of its children
The nodes can be numbered in non-decreasing order by weight
siblings are numbered consecutively common parent has a higher number
21
5
10
53
7
10
C
33
B
11 8
11
F
Different, but still optimal, greedy choice when merging nodes
4
D2
11
65 6
9
E5
A
31
2
DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:
The n leaves have nonnegative weights w1…wn
the weight of each internal node is the sum of the weights of its children
The nodes can be numbered in non-decreasing order by weight
siblings are numbered consecutively common parent has a higher number
21
5
10
53
7
10
C
32
B
11 8
11
F
Now, safe to increase B, because it can’t be greater than the next highest!
4
D2
11
65 6
9
E5
A
31
2
DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:
The n leaves have nonnegative weights w1…wn
the weight of each internal node is the sum of the weights of its children
The nodes can be numbered in non-decreasing order by weight
siblings are numbered consecutively common parent has a higher number
21
5
10
53
7
10
C
33
B
11 8
11
F
Now, safe to increase B, because it can’t be greater than the next highest!
4
D2
12
65 6
9
E6
A
41
2
Proposed Problem 2(Dynamic CSS)
Good News: DHC generalizes to k-ary alphabets
Claim:DHC is an O((k))-approximation for
CSS : k-favorable, (x)≥1 constraint-free graphs
Proposed Problem 2(Dynamic CSS)
Bad News: DHC doesn’t generalize to Huffman codes with unequal letter costsSibling property = Greedy algorithmFuture:
Explore DHC for unequal letter costsMaintain approximation ratio in constant
degree graphs in time proportional to the height
(We can do it in linear time already)
Proposed Problem 3(Category Tree - CT)
Scenario:Large reservoir of songs in iTunes
Song is a vector of categorical values
Common to search all the songs for the right one
Question: Can we organize the songs by categories so that the average search time is minimized?
Proposed Problem 3(Category Tree - CT)
Category Tree: CT(,C,S) is the degree cost C=(d1,…,dm) are the m category sizes S is a set of objects drawn from C
Solution: Rooted, oriented tree Internal nodes are categories Edges are appropriate categorical values Leaves are objects
Optimal solution: Minimize expected path cost
Path cost is defined as in CSS
Optimal solution corresponds to an adaptive ordering of the categories
Proposed Problem 3(Constrained Category Tree - CCT)
Constrained Category Tree: CCT(,C,S) is the degree cost C=(d1,…,dm) are the m category sizes S is a set of objects drawn from C
Solution: Rooted, oriented tree Internal nodes are categories (and internal nodes at the same
depth have the same category) Edges are appropriate categorical values Leaves are objects
Optimal solution: Minimize expected path cost
Path cost is defined as in CSS
Optimal solution corresponds to a fixed ordering of the categories
Proposed Problem 3(Category Tree - CT)
CT and CCT are classical Decision Tree problems
Decision Tree (DT):Input: m binary tests T=(T1…Tm) and n objects O=(O1…On)
Output: Binary tree where internal nodes are Ti and leaves or Oi
Measure: Total external path length CT and CCT are NP-Complete
Reduction from Exact Cover by 3-Sets (XC3) Resembles hardness proof for Decision Tree
Proposed Problem 3(Category Tree - CT)
Decision Tree Inference (DTI) :Input: m examples – T/F labeled binary strings from {0,1}n Output: Binary tree where internal nodes are string positions and leaves are TRUE or FALSE which is consistent with examplesMeasure: Number of leaves (i.e. size of tree)
CT and CCT are not instances of DTI DT doesn’t easily reduce to DTI Most complexity results (lower bounds on approximations) are for DTI only!
Open Problems
Theorem: There is an for any instance (G,,w) of CSS where G is constraint free, is k-favorable, maps the positive integers to the positive integers and is non-decreasing
Proof:c(T) ≥ c’(T) ≥ c’(T’) ≥ H(w)/log(k)T is optimal tree for CSS cost cT’ is optimal tree for OPC cost c’ for k symbols each with weight 1 (i.e. (x)=1)H is entropy
NO
Signatures as Representation Different lopsided trees share common substructure when truncated
Level-i-Truncations: Include node iff parent is at most i Level-i-Signatures: [m;l1,..,l(k)]
m is the # of leaves ≤ level i lj is # of nodes at level i+j
Cost of Level-i-Truncation: Exact cost for m leaves Cost up to the truncation for the remaining n-m leaves.
The Dynamic Programming Table
Signatures = Table entries MIN[m;l1,..,l(k)] gives min-cost of
all truncated trees with signature [m;l1,..,l(k)]
O(n(k)+1) entries level-i truncation is parent of
O(nk-1) level-(i+1) truncation level-i sig is parent of O(nk-1) level-(i+1) sigs
Choose how many nodes at next level will be internal Among those, choose how
many will be degree 2, degree 3, …, degree k –– O(nk-1) choices
Consistent ordering of entries O(n(k)+k) algorithm; MIN[n;0,…,0]
contains minimum cost
Set of productsThe desired information
e.g., chef & paring knives
Popularity of productsWeights
Hierarchical organization of products into categoriesSingle, global category (the root)Products are endpoints (leaves)General to specific trajectory
Adaptive Websites [Perkowitz & Etzioni]Page synthesis (novel view) with clustering and concept learning using access logsEfficiently find topic of interest (effort)
Hotlink Assignment [Kranakis, Krizanc, Shende, et. al.]Add k hotlinks per page to minimize expected distance from root to leavesRecently: pages have fixed cost proportional to their size
Hotlinks don’t change path-cost
Optimal Prefix-Free Codes [Golin & Rote]Min code for n words with r symbols where symbol ai has cost ci
Resembles CSS without a constraint graph
Lopsided Trees
[m;l1,..,l(k)] = MIN[m,l1,..,l(k)] n leaves so at most O(n(k)+1)
entries Entry stores minimum cost of
tree bearing that signature Total ordering on signatures,
consistent with the growing process
O(nk-1) choices O(n(k)+k) algorithm
Lopsided Trees
Tree cost at Level 5 in terms of Tree cost at Level 4:
Cost at Level 5: 11/3+2/3=13/3
Cost at Level 6: 13/3+1/2=29/6
The original question(warning: here be symbols)
Which (m’,l’1,l’2,l’3) (m,l1,l2,l3) Suppose we know
l’1 (the # of nodes one level below the frontier)
d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))
Let’s determine the values of the remaining variables1
2
3l’1 nodes
1
2
d2 nodes3
The original question(warning: here be symbols)
Which (m’,l’1,l’2,l’3) (m,l1,l2,l3) Suppose we know
l’1 (the # of nodes one level below the frontier)
d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))
m = m’ + l’1 - d2 - d3
The new number of leaves
The old number of leaves
Nodes at one level below the frontier
Internal nodes of degree 2
Internal nodes of degree 3
The original question(warning: here be symbols)
Which (m’,l’1,l’2,l’3) (m,l1,l2,l3) Suppose we know
l’1 (the # of nodes one level below the frontier)
d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))
m = m’ + l’1 - d2 - l3/3
The new number of leaves
The old number of leaves
Nodes at one level below the frontier
Internal nodes of degree 2
Internal nodes of degree 3
The original question(warning: here be symbols)
Which (m’,l’1,l’2,l’3) (m,l1,l2,l3) Suppose we know
l’1 (the # of nodes one level below the frontier)
d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))
l’2 = l1
The old number of nodes at2 levels below the frontier
New nodes one level below the frontier
The original question(warning: here be symbols)
Which (m’,l’1,l’2,l’3) (m,l1,l2,l3) Suppose we know
l’1 (the # of nodes one level below the frontier)
d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))
l2 = l3+2d2
The new number of nodes 2 levels below the frontier
d2 nodes are binary so they contribute 2d2 to the frontier
Organized DataPremise: People organize data so it is easy
to findNatural navigationPopular items are easily accessible
Organized Data Observation: Most existing data could be better
organized Files clutter folders; directory structures lose consistency Web pages are buried deep in the website Searching takes too much time
Organized DataQuestion: How can we automatically improve access to organized information?
Thesis Goals: Models for information organization tasks
Novel deliberation costComputational complexityAlgorithms and approximations