DETERMINANTAL POINT PROCESSESapparikh/nips2014ml-nlp/slides/dpps-nips2014.pdfDETERMINANTAL POINT...

Preview:

Citation preview

DETERMINANTAL POINT PROCESSESFOR

NATURAL LANGUAGE PROCESSING

Jennifer GillenwaterJoint work with Alex Kulesza and Ben Taskar

OUTLINE

OUTLINE

Motivation & background on DPPs

OUTLINE

Motivation & background on DPPs

Large-scale settings

OUTLINE

Motivation & background on DPPs

Large-scale settings

Structured summarization

OUTLINE

Motivation & background on DPPs

Large-scale settings

Structured summarization

Other potential NLP applications

MOTIVATION &BACKGROUND

SUMMARIZATION

SUMMARIZATION

...

SUMMARIZATION

...

SUMMARIZATION

...

Quality: relevance to

the topic

SUMMARIZATION

...

Quality: relevance to

the topic

Diversity: coverage of core ideas

SUBSET SELECTION

SUBSET SELECTION

SUBSET SELECTION

SUBSET SELECTION

AREA AS SET-GOODNESS

feature space

AREA AS SET-GOODNESS

feature space

Bi

Bj

AREA AS SET-GOODNESS

feature spacequality =

pB>

i Bi

similarity = B>i Bj

Bi

Bj

AREA AS SET-GOODNESS

feature spacequality =

pB>

i Bi

similarity = B>i Bj

Bi +Bj

Bi

Bj

AREA AS SET-GOODNESS

area

=q kB

ik2 2kB

jk2 2�(B

>iBj)2

feature spacequality =

pB>

i Bi

similarity = B>i Bj

Bi +Bj

Bi

Bj

AREA AS SET-GOODNESS

area

=q kB

ik2 2kB

jk2 2�(B

>iBj)2

feature spacequality =

pB>

i Bi

similarity = B>i Bj

Bi +Bj

Bi

Bj

AREA AS SET-GOODNESS

area

=q kB

ik2 2kB

jk2 2�(B

>iBj)2

VOLUME AS SET-GOODNESS

area =q

kBik22kBjk22 � (B>i Bj)2

VOLUME AS SET-GOODNESS

area =q

kBik22kBjk22 � (B>i Bj)2

VOLUME AS SET-GOODNESS

area =q

kBik22kBjk22 � (B>i Bj)2

length = kBik2

VOLUME AS SET-GOODNESS

area =q

kBik22kBjk22 � (B>i Bj)2

length = kBik2 volume = base ⇥ height

VOLUME AS SET-GOODNESS

area =q

kBik22kBjk22 � (B>i Bj)2

= ||B1||2vol(proj?B1(B2:N ))

vol(B) = height⇥ base

length = kBik2 volume = base ⇥ height

AREA AS A DETarea =

qkBik22kBjk22 � (B>

i Bj)2

AREA AS A DETarea =

qkBik22kBjk22 � (B>

i Bj)2

||Bi||22

B>i Bj

B>i Bj( )= det

||Bj ||22

12

AREA AS A DETarea =

qkBik22kBjk22 � (B>

i Bj)2

||Bi||22

B>i Bj

B>i Bj( )= det

||Bj ||22

12

= det( )Bi

Bj Bi

Bj

12

VOLUME AS A DET

= det( )Bi

Bj Bi

Bj

12

vol(B{i,j})

VOLUME AS A DET

= det( )Bi

Bj Bi

Bj

12

vol(B{i,j})

vol(B) = det

12B1

BN

... B1

BN. . .( )

vol(B)

2= det(B>B) = det(L)

COMPLEX STATISTICS

COMPLEX STATISTICS

COMPLEX STATISTICS

COMPLEX STATISTICSP

COMPLEX STATISTICSP

COMPLEX STATISTICSP

COMPLEX STATISTICSP

COMPLEX STATISTICSP

COMPLEX STATISTICSP

COMPLEX STATISTICSP

COMPLEX STATISTICSP

COMPLEX STATISTICS

N items =) 2N sets

EFFICIENT COMPUTATION

det12

EFFICIENT COMPUTATION

EFFICIENT COMPUTATION2

det

EFFICIENT COMPUTATION2

P

det

EFFICIENT COMPUTATION2

det

O(N3)

POINT PROCESSESY = {1, . . . , N}

POINT PROCESSESY = {1, . . . , N}

POINT PROCESSESY = {1, . . . , N}

( )P

POINT PROCESSESY = {1, . . . , N}

( )P = 0.2

DETERMINANTAL

DETERMINANTALP({2, 3, 5}) /

DETERMINANTAL

L11 L12 L13 L14 L15

L21 L22 L23 L24 L25

L35L34L33L32L31

L41 L42 L43 L44 L45

L55L54L53L52L51

P({2, 3, 5}) /

DETERMINANTAL

L11 L12 L13 L14 L15

L21 L22 L23 L24 L25

L35L34L33L32L31

L41 L42 L43 L44 L45

L55L54L53L52L51

P({2, 3, 5}) /

DETERMINANTALL22 L23 L25

L35L33L32

L55L53L52

P({2, 3, 5}) /

DETERMINANTALL22 L23 L25

L35L33L32

L55L53L52

det( )P({2, 3, 5}) /

DETERMINANTALL22 L23 L25

L35L33L32

L55L53L52

det( )P({2, 3, 5}) =

DETERMINANTALL22 L23 L25

L35L33L32

L55L53L52

det( )P({2, 3, 5}) =

det(L+ I)

EFFICIENT INFERENCE

EFFICIENT INFERENCE

PL(Y = Y )Normalizing:

EFFICIENT INFERENCE

PL(Y = Y )

P(Y ✓ Y)

Normalizing:

Marginalizing:

EFFICIENT INFERENCE

PL(Y = Y )

PL(Y = B | A ✓ Y)

PL(Y = B | A \Y = ;)

P(Y ✓ Y)

Normalizing:

Marginalizing:

Conditioning:

EFFICIENT INFERENCE

PL(Y = Y )

PL(Y = B | A ✓ Y)

PL(Y = B | A \Y = ;)

Y ⇠ PL

P(Y ✓ Y)

Normalizing:

Marginalizing:

Conditioning:

Sampling:

EFFICIENT INFERENCE

PL(Y = Y )

PL(Y = B | A ✓ Y)

PL(Y = B | A \Y = ;)

Y ⇠ PL

P(Y ✓ Y)

O(N3)

Normalizing:

Marginalizing:

Conditioning:

Sampling:

LARGE-SCALE SETTINGS

DUAL KERNELKULESZA AND TASKAR (NIPS 2010)

DUAL KERNEL

B1

BN

...

B2

B3

B1

BN. . .B2

B3

LKULESZA AND TASKAR (NIPS 2010)

DUAL KERNEL

B1

BN

...

B2

B3

B1

BN. . .B2

B3

L

N ⇥N

=

KULESZA AND TASKAR (NIPS 2010)

DUAL KERNEL

B1

BN

...

B2

B3

B1

BN. . .B2

B3

C

N ⇥N

=

KULESZA AND TASKAR (NIPS 2010)

DUAL KERNEL

B1

BN

...

B2

B3

B1

BN. . .B2

B3

C

=

KULESZA AND TASKAR (NIPS 2010)

DUAL KERNEL

B1

BN

...

B2

B3

B1

BN. . .B2

B3

= D ⇥D

CKULESZA AND TASKAR (NIPS 2010)

DUAL INFERENCE

DUAL INFERENCE

L = V ⇤V > C = V̂ ⇤V̂ >

DUAL INFERENCE

L = V ⇤V > C = V̂ ⇤V̂ >V = B>V̂ ⇤� 12

DUAL INFERENCE

L = V ⇤V > C = V̂ ⇤V̂ >V = B>V̂ ⇤� 12

O(D3)P

Y det(LY )Normalizing

DUAL INFERENCE

L = V ⇤V > C = V̂ ⇤V̂ >V = B>V̂ ⇤� 12

O(D3)P

Y det(LY )Normalizing

O(D3 +D2k2)Marginalizing & Conditioning

DUAL INFERENCE

L = V ⇤V > C = V̂ ⇤V̂ >V = B>V̂ ⇤� 12

O(ND2k)Y ⇠ PLSampling

O(D3)P

Y det(LY )Normalizing

O(D3 +D2k2)Marginalizing & Conditioning

EXPONENTIAL N

EXPONENTIAL N

N = O({sentence length}{sentence length})

We want to select a diverse set of parses.

EXPONENTIAL N

N = O({sentence length}{sentence length})

We want to select a diverse set of parses.

N = O({node degree}{path length})

i =

STRUCTURE FACTORIZATIONKULESZA AND TASKAR (NIPS 2010)

i =

STRUCTURE FACTORIZATION

Bi = q(i)�(i)

KULESZA AND TASKAR (NIPS 2010)

quality similarity

i =

STRUCTURE FACTORIZATION

Bi = q(i)�(i)

i = {i↵}↵2F↵

c = 1

KULESZA AND TASKAR (NIPS 2010)

quality similarity

i =

STRUCTURE FACTORIZATION

Bi = q(i)�(i)

i = {i↵}↵2F↵

c = 2

KULESZA AND TASKAR (NIPS 2010)

quality similarity

i =

STRUCTURE FACTORIZATION

i = {i↵}↵2F

Bi =

Q↵2F

q(i↵)

��(i)

c = 2

KULESZA AND TASKAR (NIPS 2010)

i =

STRUCTURE FACTORIZATION

i = {i↵}↵2F

Bi =

Q↵2F

q(i↵)

� P↵2F

�(i↵)

c = 2

KULESZA AND TASKAR (NIPS 2010)

i =

STRUCTURE FACTORIZATION

i = {i↵}↵2F

Bi =

Q↵2F

q(i↵)

� P↵2F

�(i↵)

c = 2

O(ND2k)Y ⇠ PL

KULESZA AND TASKAR (NIPS 2010)

STRUCTURE FACTORIZATION

Bi =

Q↵2F

q(i↵)

� P↵2F

�(i↵)

M = R =

c = 2

Y ⇠ PL O(D2k3 +Dk2M cR)

KULESZA AND TASKAR (NIPS 2010)

STRUCTURE FACTORIZATION

Bi =

Q↵2F

q(i↵)

� P↵2F

�(i↵)

M = R =

c = 2

Y ⇠ PL O(D2k3 +Dk2M cR)

KULESZA AND TASKAR (NIPS 2010)

M cR = 42 ⇤ 12 = 192 ⌧ N = 412 = 16,777,216

LARGE FEATURE SETS?

Large Exponential

Small dualdual +

structure

Large ? ?

LARGE FEATURE SETS?N = # of items

D=

#offeatures

Large Exponential

Small dualdual +

structure

Large ? ?

LARGE FEATURE SETS?N = # of items

D=

#offeatures

RANDOM PROJECTIONSGILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

RANDOM PROJECTIONS

N

D

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

RANDOM PROJECTIONS

D

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

M cR

RANDOM PROJECTIONS

D d

D⇥�

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

M cR

RANDOM PROJECTIONS

D d

D⇥

d

=�

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

M cR M cR

RANDOM PROJECTIONSd

D

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

RANDOM PROJECTIONSd

D

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

VOLUME PRESERVATION

VOLUME PRESERVATIONJOHNSON AND LINDENSTRAUSS (1984)

VOLUME PRESERVATIONJOHNSON AND LINDENSTRAUSS (1984)

logN

VOLUME PRESERVATIONJOHNSON AND LINDENSTRAUSS (1984)

logN

VOLUME PRESERVATIONMAGEN AND ZOUZIAS (2008)

logN

VOLUME PRESERVATIONMAGEN AND ZOUZIAS (2008)

logN

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

DPP PRESERVATIONvol

2= det

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

DPP PRESERVATION

k = 1

vol

2= det

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

DPP PRESERVATION

k = 1

k = 2

vol

2= det

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

DPP PRESERVATION

k = 1

k = 2

k = 3

vol

2= det

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

DPP PRESERVATION

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

DPP PRESERVATION

d = O⇣

max

n

k✏ ,

log(1/�)+log(N)

✏2 + ko⌘

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

DPP PRESERVATION

d = O⇣

max

n

k✏ ,

log(1/�)+log(N)

✏2 + ko⌘

total # of itemssubset size

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

DPP PRESERVATION

d = O⇣

max

n

k✏ ,

log(1/�)+log(N)

✏2 + ko⌘

w.p. 1� � : kPk � P̃kk1 e6k✏ � 1

total # of itemssubset size

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

DPP PRESERVATION

d = O⇣

max

n

k✏ ,

log(1/�)+log(N)

✏2 + ko⌘

0 50 100 1500

0.2

0.4

0.6

0.8

1

1.2

L1 v

aria

tiona

l dis

tanc

e

Projection dimension

0

1

2

3

4x 108

Mem

ory

use

(byt

es)

w.p. 1� � : kPk � P̃kk1 e6k✏ � 1

total # of itemssubset size

STRUCTUREDSUMMARIZATION

NEWS THREADING

NEWS THREADING

NEWS THREADING

March 28: Health officials confirm Ebola outbreak in Guinea’s capital

NEWS THREADING

March 28: Health officials confirm Ebola outbreak in Guinea’s capital

August 8: World Health Organization declares Ebola epidemic an

international health emergency

NEWS THREADING

March 28: Health officials confirm Ebola outbreak in Guinea’s capital

August 8: World Health Organization declares Ebola epidemic an

international health emergency

September 2: GlaxoSmithKlein begins Ebola vaccine drug trial

NEWS THREADING

10360

March 28: Health officials confirm Ebola outbreak in Guinea’s capital

August 8: World Health Organization declares Ebola epidemic an

international health emergency

September 2: GlaxoSmithKlein begins Ebola vaccine drug trial

M ⇡ 35,000

PROJECTING NEWS FEATURES

PROJECTING NEWS FEATURES

�(i)

D = 36,356

PROJECTING NEWS FEATURES

G

G�(i)�(i)

D = 36,356 d = 50

30

31

DPP THREADS

DPP THREADS

Jan 08 Jan 28 Feb 17 Mar 09 Mar 29 Apr 18 May 08 May 28 Jun 17

pope vatican church parkinson

israel palestinian iraqi israeli gaza abbas baghdad

owen nominees senate democrats judicial filibusters

social tax security democrats rove accounts

iraq iraqi killed baghdad arab marines deaths forces

DPP THREADS

Jan 08 Jan 28 Feb 17 Mar 09 Mar 29 Apr 18 May 08 May 28 Jun 17

pope vatican church parkinson

israel palestinian iraqi israeli gaza abbas baghdad

owen nominees senate democrats judicial filibusters

social tax security democrats rove accounts

iraq iraqi killed baghdad arab marines deaths forces

Feb 24: Parkinson's Disease Increases Risks to PopeFeb 26: Pope's Health Raises Questions About His Ability to LeadMar 13: Pope Returns Home After 18 Days at HospitalApr 01: Pope's Condition Worsens as World Prepares for End of PapacyApr 02: Pope, Though Gravely Ill, Utters Thanks for PrayersApr 18: Europeans Fast Falling Away from ChurchApr 20: In Developing World, Choice [of Pope] Met with SkepticismMay 18: Pope Sends Message with Choice of Name

System k-means DTM DPP

ROUGE-1F 16.5 14.7 17.2

R-SU4F 3.76 3.44 3.98

Coherence 2.73 3.2 3.3

QUANTITATIVE RESULTS

System k-means DTM DPP

ROUGE-1F 16.5 14.7 17.2

R-SU4F 3.76 3.44 3.98

Coherence 2.73 3.2 3.3

QUANTITATIVE RESULTS

System k-means DTM DPP

ROUGE-1F 16.5 14.7 17.2

R-SU4F 3.76 3.44 3.98

Coherence 2.73 3.2 3.3

QUANTITATIVE RESULTS

System k-means DTM DPP

ROUGE-1F 16.5 14.7 17.2

R-SU4F 3.76 3.44 3.98

Coherence 2.73 3.2 3.3

QUANTITATIVE RESULTS

System k-means DTM DPP

ROUGE-1F 16.5 14.7 17.2

R-SU4F 3.76 3.44 3.98

Coherence 2.73 3.2 3.3

Runtime (s) 626 19,434 252

QUANTITATIVE RESULTS

OTHER POTENTIALNLP APPLICATIONS

POTENTIAL APP: RE-RANKING

• Parser: simple model with local features defines basic scores for all possible parse trees

POTENTIAL APP: RE-RANKING

• Parser: simple model with local features defines basic scores for all possible parse trees

• Re-ranker: more complex model with non-local features provides more refined scores

POTENTIAL APP: RE-RANKING

• Parser: simple model with local features defines basic scores for all possible parse trees

• Re-ranker: more complex model with non-local features provides more refined scores

• Typical pipeline: find the k highest-scoring parses under the simple model, then score these k with the more complex model and output the best

POTENTIAL APP: RE-RANKING

• Parser: simple model with local features defines basic scores for all possible parse trees

• Re-ranker: more complex model with non-local features provides more refined scores

• Typical pipeline: find the k highest-scoring parses under the simple model, then score these k with the more complex model and output the best

• Issue: the k may be largely redundant, so re-ranker does not get to consider significantly different parses

POTENTIAL APP: RE-RANKING

IDEA: USE DPPS FORSELECTING RE-RANKER INPUT

IDEA: USE DPPS FORSELECTING RE-RANKER INPUT

N = O({sentence length}{sentence length})

We want to select a diverse set of parses.

IDEA: USE DPPS FORSELECTING RE-RANKER INPUT

N = O({sentence length}{sentence length})

We want to select a diverse set of parses.

Quality: standard

parser scores

IDEA: USE DPPS FORSELECTING RE-RANKER INPUT

N = O({sentence length}{sentence length})

We want to select a diverse set of parses.

Quality: standard

parser scores

Diversity: edge lengths,

POS pairs, etc.

POTENTIAL APP:WORD SENSE INDUCTION

POTENTIAL APP:WORD SENSE INDUCTION• Goal: identify all possible senses of ambiguous

words (e.g. river bank vs bank deposit)

POTENTIAL APP:WORD SENSE INDUCTION• Goal: identify all possible senses of ambiguous

words (e.g. river bank vs bank deposit)

• Typical approach: unsupervised clustering, cluster centers represent word senses

POTENTIAL APP:WORD SENSE INDUCTION• Goal: identify all possible senses of ambiguous

words (e.g. river bank vs bank deposit)

• Typical approach: unsupervised clustering, cluster centers represent word senses

• Why DPPs fit: can re-express finding cluster centers as the problem of finding a high-quality, diverse set

POTENTIAL APP:WORD SENSE INDUCTION• Goal: identify all possible senses of ambiguous

words (e.g. river bank vs bank deposit)

• Typical approach: unsupervised clustering, cluster centers represent word senses

• Why DPPs fit: can re-express finding cluster centers as the problem of finding a high-quality, diverse set

Quality: centrality (density of points nearby)

POTENTIAL APP:WORD SENSE INDUCTION• Goal: identify all possible senses of ambiguous

words (e.g. river bank vs bank deposit)

• Typical approach: unsupervised clustering, cluster centers represent word senses

• Why DPPs fit: can re-express finding cluster centers as the problem of finding a high-quality, diverse set

Quality: centrality (density of points nearby)

Diversity: same as standard

WSI features

QUESTIONS?

Recommended