Upload
daniil-mirylenka
View
117
Download
3
Embed Size (px)
Citation preview
SUPERVISED PREDICTION OF GRAPH SUMMARIES
Daniil Mirylenka University of Trento, Italy
1
Outline • Motivating example (from my Ph.D. research) • Supervised learning
• Binary classification • Perceptron
• Ranking, Multiclass
• Structured output prediction • General approach, Structured Perceptron • “Easy” cases
• Prediction as search • Searn, DAGGER • Back to the motivating example
2
Motivating example Representing academic search results
3
graph summary
search results
Motivating example Suppose we can do this:
4
large graph
Motivating example Then we only need to do this:
5
small graph
Motivating example What is a good graph summary?
Let’s learn from examples!
6
Supervised learning
7
What is supervised learning?
x1, y1( ), x2, y2( ), …, xn, yn( )
f : x! y
8
Supervised Learning
bunch of examples
function
x1, y1( ), x2, y2( ), …, xn, yn( )
9
distribution of examples
Statistical learning theory
P(x,y)
Where do our examples come from?
samples drawn i.i.d.
10
hypothesis space
Statistical learning theory What functions do we consider?
f ∈ H
linear?
H1
cubic?
H2
piecewise-linear?
H3
11
loss function
Statistical learning theory How bad is it to predict instead of (true) ? f x( ) y
L f x( ), y( )
L y, !y( ) =0, when y = !y1, otherwise"#$
Example: zero-one loss
Statistical learning theory
argminf∈H
L( f (x), y)p(x, y)dxdyX×Y∫
Goal:
Requirement:
argminf∈H
L( f (xi ), yi )i=1
n
∑
12
expected loss on new examples
total loss on training data
Linear models
w = argminw
L( fw x( ), yi )i=1
n
∑
Inference (prediction):
fw x( ) = g( w,ϕ x( ) )
optimization with respect to w (e.g. gradient descent)
Learning:
13
features of x scalar product (linear combination, weighted sum)
H
Binary classification y ∈ {−1,1}Prediction: fw x( ) = sign( w, x )
14
above or below the line (hyperplane)?
yi w, xi > 0Note that:
w, x > 0
w, x < 0
w, x = 0
Perceptron Learning algorithm (optimizes one example at a time)
For every xiif yi w, xi ≤ 0
w← w+ yi xi
• if made a mistake
• update the weights
if yi>0 makes wi+1 more like xi
if yi<0 makes wi+1 more like -xi
15
Repeat
Perceptron Update rule:
wold
16
xi
w← w+ yi xi
Perceptron
wold
17
xi wnew
Update rule: w← w+ yi xi
Max-margin classification Idea: ensure some distance form the hyperplane
18
yi w, xi ≥1Require:
Preference learning Suppose we want to predict rankings: x! y =
v1v2"vk
x,vi( ) ≻ x,vj( )⇔ i < j
19
joint features of x and v
wϕ x,v1( )
ϕ x,v2( )
ϕ x,v3( )ϕ x,v4( )
ϕ x,v5( )Also works for: • selecting just the best one • multiclass classification
w,ϕ x,v( )−ϕ x, "v( ) ≥1
Structured prediction
20
Structured prediction Examples:
• “Time flies like an arrow.” x =
part-of-speech tagging • (noun verb preposition determiner noun)!y =
• (S (NP (NNP Time)) (VP (VBZ flies) (PP (IN like) (NP (DT an) (NN arrow)))))
y =parse tree
or
21
Structured prediction How can we approach this problem?
f x( ) = g w,ϕ x( )( )• before we had:
• now must be a complex object f x( )
f x( ) = argmaxy
w,ψ x, y( )( )joint features of x and y (kind of like we had with ranking)
22
Structured Perceptron Almost the same as ordinary perceptron
• For every xi
• if yi ≠ yiw← w+ψ xi, yi( )−ψ xi, yi( )
(if made a mistake)
update the weights
• predict: yi = argmaxy
w,ψ xi, y( )( )
•
23
Argmax problem
often infeasible
y = argmaxy
w,ψ x, y( )( )Prediction:
Examples:
24
• sequence of length T, with d options for each label: dT
• a subgraph of size T from a graph G: |G| chose T
• 10-word sentence, 5 parts of speech: ~10 million outputs
• 10-node subgraph of a 300-node graph:
1,398,320,233,231,701,770 outputs (around 1018)
Argmax problem
y = argmaxy
w,ψ x, y( )( )Prediction:
Learning: • even more difficult • includes prediction as a subroutine
25
often infeasible
Argmax problem: easy cases Independent prediction
• suppose decomposes into
• and decomposes into
y v1,v2,…,vT( )
ψ x, y( ) = ψi x,vi( )i=1
T
∑
ψ x, y( )
argmaxy
w,ψ x, y( ) =
argmaxv1
w,ψ1 x,v1( ) ,…, argmaxvn
w,ψn x,vn( )!
"#
$
%&
• then predictions can be made independently
26
Argmax problem: easy cases Sequence labeling
• suppose decomposes into
• and decomposes into
y v1,v2,…,vT( )
ψ x, y( ) = ψi x,vi,vi+1( )i=1
T−1
∑
ψ x, y( )
• dynamic programming : O(Td2)
27
• with ternary features : O(Td3), etc.
• in general tractable in graphs with bounded treewidth
Approximate argmax General idea: • search
in the space of outputs
Natural generalization: • space of partial outputs
• composing the solution sequentially
Let’s learn to make good moves!
How do we decide which moves to take? Most interesting/crazy
idea of this talk
(And we don’t need the original argmax problem anymore)
28
Learning to search
29
Learning to search Sequential prediction of structured outputs
y = v1,v2,…,vT( )• decompose the output
π : v1,v2,…,vt( )→ vt+1• learn the policy
state action
s0π! →! s1
π! →!… π! →! sT = y• apply the policy sequentially
st → vt+1
• policy can be trained on examples st,vt+1( )• preference learning
30
Learning to search The caveat of sequential prediction
siStates : coordinates of the car
Actions : steering (‘left’, ‘right’) vi+1
Oops!
left!
left!
right!
right!
left!
Problem: • errors accumulate • training data is not i.i.d.!
Solution: • Train on the states produced by our policy!
Chicken-and-egg problem (solution: iterate)
31
Searn and DAGGER Searn = “search” + “learn” • start from optimal policy;
move away • generate new states with current • generate actions based on regret • train on new state-action pairs • interpolate the current policy
π i+1← βπ i + 1−β( ) #π i+1
policy learnt at ith iteration
regret
32
[1] Hal Daumé III, John Langford, Daniel Marcu. Search-based Structured Prediction. Machine Learning Journal, 2006.
[1]
!π i+1
π i
Searn and DAGGER DAGGER = “dataset” + “aggregation” • start from the ‘ground truth’ dataset,
enrich it with new state-action pairs • train a policy on the current dataset • use the policy to generate new states • generate ‘expert’s actions’ for new states • add new state-action pairs to the dataset
expert’s actions
As in Searn, we’ll eventually be training on our own-produced states
33
[2] Stephane Ross, Geoffrey Gordon, Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. Journal of Machine Learning Research, 2011.
[2]
DAGGER for building the graph summaries Input: topic graph , search results , relation Output: topic summary of size
G V,E( )
R ⊆V × SS
GT VT ,ET( ) T
A few tricks: • Predict only vertices ,
VT
• Require that the summaries be nested: • which means
• hence, the task is to predict the sequence
∅ =V0 ⊂V1 ⊂…⊂VTVi+1 =Vi + vi+1
v1,v2,…,vT( )
34
DAGGER for building the graph summaries • Provide the ‘ground truth’ topic sequences
V,S,R( ),v1,v2,…,vT( ) a single ground truth example
topics (vertices) documents
(search results)
topic-document relations
topic sequence
• Create the dataset D0 = ∪ si,vi+1( ){ }i=0T
• train the policy on π i Di• apply to the initial states to generate state sequences π i s0
s1, s2,…, sT( )
empty summary
intermediate summaries
• produce the ‘expert action’ for every generated state • produce
v*Di+1 = Di∪ s,v*( ){ }
35
DAGGER: producing the ‘expert action’ • The expert’s action brings us closer
to the ‘ground-truth’ trajectory
expert’s actions
• Suppose the ‘ground-truth’ trajectory is
36
s1, s2,…, sT( )• And the generated trajectory is
s1, s2,…, sT( )• The expert’s action vi+1* = argmin
vΔ si∪ v{ }, si+1( )( )
dissimilarity between the states
DAGGER: topic sequence dissimilarity
• Set-based dissimilarity, e.g. Jaccard distance • similarity between topics ?
• encourages redundancy
• Sequence-matching based dissimilarity
• greedy approximation
37
Δ v1,v2,…,vt( ), !v1, !v2,…, !vt( )( )
DAGGER: topic graph features
• Coverage and diversity • [transitive] document coverage
• [transitive] topic frequency, average and min • topic overlap, average and max
• parent-child overlap, average and max • …
38
ψ V,S,R( ), v1,v2,…,vt( )( )
Recap • We’ve learnt:
• … how to do binary classification • and implement it in 4 lines of code
• … about more complex problems (ranking, and structured prediction) • general approach, structured Perceptron • argmax problem
• … that learning and search are two sides of the same coin • … how to predict complex structures by building them sequentially
• Searn and DAGGER
39
Questions?
40
dmirylenka @ disi.unitn.it
Extra slides
41
Support Vector Machine Idea: large margin between positive and negative examples
Loss function:
L y, f x( )( ) = [1− y ⋅ f x( )]+Hinge loss
(solved by constrained convex optimization)
yi w, xi ≥CC→max
#$%
&%⇔
yi w, xi ≥1
w →min
#$%
&%
42
Structured SVM
correct outputs score higher by a margin
w →min
w,ψ xi, y( ) − w,ψ xi, yi( ) ≥1, yi ≠ y
w,ψ xi, yi( )−ψ xi, y( ) ≥ Δ yi, y( ), yi ≠ ymargin depends on dissimilarity
Taking into account (dis)similarity between the outputs
43