59
Learning and Generating Paraphrases From Twitter and Beyond Wei Xu Guest Lecture @ Penn MT class April-2-2015 Computer and Informa/on Science University of Pennsylvania

Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Learning and Generating Paraphrases

From Twitter and Beyond

Wei Xu

Guest Lecture @ Penn MT class April-2-2015

Computer)and)Informa/on)Science)University)of)Pennsylvania

Page 2: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Research OverviewTACL 15!

!NAACL 15!

!TACL 14!

!ACL 14!

!ACL 13!

!BUCC 13!

!LSAM 13!

!COLING 12!

!IJCNLP 11!

!EMNLP 11!

!ACL 06

Social Media

Paraphrase

Information Extraction

Page 3: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Paraphrase

Page 4: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Paraphrase

… the forced resignation of the CEO of Boeing,

Harry Stonecipher, for …

the king’s speech His Majesty’s address

wealthy richword

phrase

sentence… after Boeing Co. Chief Executive Harry Stonecipher was ousted from …

Page 5: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

ApplicationInformation Extraction

end_job (Harry Stonecipher, Boeing)

Wei)Xu,)Raphael)Hoffmann,)Le)Zhao,)Ralph)Grishman.)“Filling)Knowledge)Base)Gaps)for)Distant)Supervision)of)Rela/on)Extrac/on”))In)ACL)(2013)))

extract

… the forced resignation of the CEO of Boeing,

Harry Stonecipher, for …

… after Boeing Co. Chief Executive Harry Stonecipher was ousted from …

Page 6: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

ApplicationQuestion Answering

Who is the CEO stepping down from Boeing?

match

… the forced resignation of the CEO of Boeing,

Harry Stonecipher, for …

… after Boeing Co. Chief Executive Harry Stonecipher was ousted from …

Page 7: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

ApplicationText Simplification

They are culturally akin to the coastal peoples of Papua New Guinea.

Their culture is like that of the coastal peoples of Papua New Guinea.

Wei)Xu,)Chris)CallisonUBurch.)“Problems)in)Current)Text)Simplifica/on)Research:)New)Data)Can)Help”))to)appear)in)TACL)(2015)))NSF)EAGER:)“Simplifica/on)as)Machine)Transla/on”)(2014)~)2015))

Page 8: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

ApplicationStylistic Rewriting

Palpatine: If you will not be turned, you will be destroyed!

If you will not be turn’d, you will be undone!

Wei)Xu,)Alan)Ri_er,)Bill)Dolan,)Ralph)Grishman,)Colin)Cherry.)“Paraphrasing)for)Style”)In)COLING)(2012)))

Luke: Father, please! Help me!

Father, I pray you! Help me!

Page 9: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Previous Work

But, primarily for formal language usage and well-edited text

Numerous publications on paraphrase identification, extraction, generation and various applications

Page 10: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Previous Work

only a few hundreds news agencies report big events

using formal language

(Dolan,)Quirk)and)Brocke_,)2004;)Dolan)and)Brocke_,)2005;)Brocke_)and)Dolan,)2005))

Page 11: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Twitter as a new resource

Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“A)Preliminary)Study)of)Tweet)Summariza/on)using)Informa/on)Extrac/on”)in)LASM)(2014)))

Page 12: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Twitter as a powerful resource

thousands of users talk about both big and micro events

using formal, informal, erroneous language

Very%diverse!%

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

Page 13: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Enables new applications

pittsburgh

pghpittsburgpixburgh

pitsteelers

against the steelersagainst pittsburgh

Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“Gathering)and)Genera/ng)Paraphrases)from)Twi_er)with)Applica/on)to)Normaliza/on”))In)BUCC)(2013)))

Information Retrieval

?

Page 14: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Enables new applications

Noisy Text Normalization

oscar nom’d doc

Oscar-nominated documentary

don’t want for

don’t wait for

Wei)Xu,)Joel)Tetreault,)Mar/n)Chodorow,)Ralph)Grishman,)Le)Zhao.)“Exploi/ng)Syntac/c)and)Distribu/onal)Informa/on)for)Spelling)Correc/on)with)WebUScale)NUgram)Models”)In)EMNLP)(2011)))

Page 15: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Enables new applications

who wants to get a beer?

Human-computer Interaction

Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“Gathering)and)Genera/ng)Paraphrases)from)Twi_er)with)Applica/on)to)Normaliza/on”))In)BUCC)(2013)))

want to get a beer?

who else wants to get a beer?

who wants to go get a beer?

trying to get a beer?

who wants to buy a beer?

who else wants to get a beer?

… (21 different ways)

Page 16: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Enables new applications

Language Education

Aaaaaaaaand stephen curry is on fire

What a incredible performance from Stephen Curry

Page 17: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Enables new applications

Wowsers to this nets bulls game

This nets vs bulls game is great

Sentiment Analysis

This Nets vs Bulls game is nuts

This Nets and Bulls game is a good game

this Nets vs Bulls game is too live

This NetsBulls series is intense

This netsbulls game is too good

Page 18: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Learn Paraphrases

Page 19: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Learn Paraphrases

Mancini has been sacked by Manchester City

Mancini gets the boot from Man City

identify parallel sentences automatically !from Twitter’s big data stream

WORLD OF JENKS IS ON AT 11

World of Jenks is my favorite show on tv

Yes!%

No!$

Page 20: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Early Attempts

• 1242 tweet pairs, tracking celebrity & hashtags (Zanzotto, Pennacchiotti and Tsioutsiouliklis, 2011)

• named entity + date (Xu, Ritter and Grishman, 2013)

• bilingual posts (Ling, Dyer, Black and Trancoso, 2013)

Page 21: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Design a Model

Train it on data

Page 22: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

A ChallengeMancini has been sacked by Manchester City

Mancini gets the boot from Man City

very short lexically divergent

!(less word overlap, even in high-dimensional space)

Page 23: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Design a Model

two sentences about the same topic are paraphrases if and only if

they contain at least one word pair that is a paraphrase anchor

That boy Brook Lopez with a deep 3

brook lopez hit a 3

At-least-one-anchor Assumption

Yes!%

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

Page 24: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Another Challenge

not every word pair of similar meaning indicates sentence-level paraphrase

Solution: a discriminative model using features at word-level

Iron Man 3 was brilliant fun

Iron Man 3 tonight see what this is like No!$

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

Page 25: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Multi-instance Learning Paraphrase Model

Z2"0" 0" 0"

1" 0"

..."

man$"|"teo" be"|"is" ..."

Z1" Z4"

Y#paraphrase" Y#non2paraphrase"

Z3"1"

next"|"new"

diff_word"same_pos_nn"both_sig"…"

same_stem"same_pos_be"not_both_sig"…"

diff_word"same_pos_jj"both_sig"…"

diff_word"diff_pos_nn"diff_pos_jj"not_both_sig"…"

man$"|"li>le"

sentence"pair"

word"pair"

features(

Manti bout to be the next Junior SeauTeo is the little new Junior Seau

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

Page 26: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

[Mini Tutorial] Multi-instance Learning

Nega%ve'Bags' Posi%ve'Bags''

A'bag'is'labeled'posi%ve,'if''there'is'at#least#one'posi%ve'example'

A'bag'is'labeled'nega%ve,'if''all'the'examples'in'it'are'nega%ve'

Instead of labels on each individual instance, the learner only observes labels on bags of instances.

(Die_erich)et)al.,)1997))

Page 27: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

[Mini Tutorial] Multi-instance Learning

Z2"?" ?"

1"

Z1"

Y"

Z3" 1"

bag"label"(observed)"

instance"label"(latent)"

Posi7ve"Bag""

A"bag"is"labeled"posi7ve,"if""there"is"at#least#one"posi7ve"example"

features"

constraints"

Latent Variable Model

Page 28: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

[Mini Tutorial] Multi-instance Learning

Latent Variable Model

Z2"0" 0"

0"

Z1"

Y"

Z3" 0"

instance"label"(latent)"

Nega3ve"Bag""

A"bag"is"labeled"nega3ve,"if""all"the"examples"in"it"are"nega3ve"

features"

constraints"

bag"label"(observed)"

Page 29: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

[Mini Tutorial] Multi-instance Learning

Maria)Pershina,)Bonan)Min,)Wei)Xu,)Ralph)Grishman.)“Infusion)of)Labeled)Data)into)Distant)Supervision)for)Rela/on)Extrac/on”))In)ACL)(2014))))Wei)Xu,)Raphael)Hoffmann,)Le)Zhao,)Ralph)Grishman.)“Filling)Knowledge)Base)Gaps)for)Distant)Supervision)of)Rela/on)Extrac/on”))In)ACL)(2013))))

Wei)Xu,)Ralph)Grishman,)Le)Zhao.)“Passage)Retrieval)for)Informa/on)Extrac/on)using)Distant)Supervision”))In)IJCNLP)(2011))))

Distantly Supervised Information Extraction

1. incomplete knowledge base problem

2. distant supervision + human-labeled dataG

|R|

|xi|n

zi

hi

yi

xi9>>=

>>;

{relationlevel

mentionlevel

3. IE + IR

Page 30: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

[Recap] Multi-instance Learning Paraphrase Model

Z2"0" 0" 0"

1" 0"

..."

man$"|"teo" be"|"is" ..."

Z1" Z4"

Y#paraphrase" Y#non2paraphrase"

Z3"1"

next"|"new"

diff_word"same_pos_nn"both_sig"…"

same_stem"same_pos_be"not_both_sig"…"

diff_word"same_pos_jj"both_sig"…"

diff_word"diff_pos_nn"diff_pos_jj"not_both_sig"…"

man$"|"li>le"

sentence"pair"

word"pair"

features(

Manti bout to be the next Junior SeauTeo is the little new Junior Seau

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

Page 31: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Joint Word-Sentence Model

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

Zi"

Y#"

W×W""

S×S""

sentence"pair"

word"pair"

determinis2c"OR"

bag"label"(observed)"

instance"label"(latent)"

Model the assumption:!sentence-level paraphrase

is anchored by at-least-one word pair

Zj"

Page 32: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Joint Word-Sentence Model

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

featuresparameters determinis/c)OR

jth)word)pair

ith)sentence)pair’s)label))(observed)or)to)be)predicated))

latent)labels)for)all)word)pairs))in)the)ith)sentence)pair)

Page 33: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Learning AlgorithmObjective:!

learn the parameters that maximize likelihood over the training corpus

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

ith#training#sentence#pair all#possible#values#of#the#latent#variables

Page 34: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

reward#correct#(condi6oned#on#labels)

Learning AlgorithmPerceptron-style Update:!

Viterbi approximation + online learning O(# word pairs)

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

penalize#wrong#(ignoring#labels)

Page 35: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Training Data

Page 36: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

AnnotationCrowdsourcing

(Courtesy:)The)Sheep)Market)by)Aaron)Koblin)

Page 37: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Annotation

Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))

Crowdsourcing

Page 38: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

A Problem

only 8% sentence pairs about the same topic have similar meaning

hurts both quantity and quality

non#experts*lower*their*bars*

Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))

Page 39: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Sentence Selection

Netflix

Jeff Green

Ryu

The Clippers

Reggie Miller

0 0.2 0.4 0.6 0.8

Random w/ Selection

SumBasic Algorithm

8% 16%

percentages of paraphrasesWei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))

Page 40: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Topic SelectionMulti-Armed Bandits

16% 34%

Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))

Page 41: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Twitter Paraphrase Dataset18,762 sentence pairs labeled

cost only $200 !!!

1/3 paraphrase, 2/3 non-paraphrase (very balanced)

including a very broad range of paraphrases: synonyms, misspellings, slang, acronyms and colloquialisms

Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))

important)but)difficult)to)obtain

Page 42: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Performance

Page 43: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Performance

40

55

70

85

100

(Das&Smith,2009) (Guo&Diab,2012) (Ji&Eisenstein,2013) Our Model Human Upperbound

90.8

72.6

62.865.563.2

75.272.266.4

52.5

62.9

Precision Recall

state-of-the-art of paraphrase identification

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

Page 44: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

40

55

70

85

100

(Das&Smith,2009) (Guo&Diab,2012) (Ji&Eisenstein,2013) Our Model Human Upperbound

90.8

72.6

62.865.563.2

75.272.266.4

52.5

62.9

Precision Recall

Performance

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

Our Model

(Ji&Eisenstein,2013)

Page 45: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

ImpactSemEval 2015 shared task on “Paraphrase in Twitter”

19 + 1 teams participated !

100+ research groups have requested the data since Nov 2014

paraphrase identification (0 or 1) rank 1semantic similarity (0 ~ 1) rank 4

our model

Wei)Xu,)Chris)CallisonUBurch,)Bill)Dolan.)“SemEvalU2015)Task)1:)Paraphrase)and)Seman/c)Similarity)in)Twi_er)(PIT)”)In)SemEval)(2015))

Page 46: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

InnovationsThat boy Brook Lopez with a deep 3

brook lopez hit a 3 Yes!%

Multi-instance Learning Paraphrase Model (MultiP)

- Twitter’s big data stream - potential beyond Twitter and English - joint sentence-word alignment - extensible latent variable model

Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))

(a lot of space for future work)

Page 47: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Generate Paraphrases

Page 48: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Extract Phrasal Paraphrases

Mancini has been sacked by Manchester City

Mancini gets the boot from Man City

align

Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“Gathering)and)Genera/ng)Paraphrases)from)Twi_er)with)Applica/on)to)Normaliza/on”))In)BUCC)(2013)))

Page 49: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Extract Phrasal Paraphrases

has been sacked by gets the boot from

manchester city man city

4 for

4 four

outta out of

hostes hostess

Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“Gathering)and)Genera/ng)Paraphrases)from)Twi_er)with)Applica/on)to)Normaliza/on”))In)BUCC)(2013)))

Page 50: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Text-to-text Generation

business

Hostes outtais going biz .

.out ofis goingHostess

translate

Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“Gathering)and)Genera/ng)Paraphrases)from)Twi_er)with)Applica/on)to)Normaliza/on”))In)BUCC)(2013)))

Page 51: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Statistical Machine Translation

Bilingual Monolingual

studied

sensitive to errorobjective straightforward sophisticated

lessmoreless more

a lot more recently

has standard evaluation yes not quite yet

naturally available parallel text

Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))

(Paraphrase =)

Page 52: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Text-to-text Generation

complex simple

stylistic plain

noisy standard

erroneous correct

and more (future work) …

(Xu et al. 2013)

(Xu et al. 2012)

(Xu et al. 2015)

(Xu et al. 2011)

Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))

Page 53: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Prose to SonnetWandering through rows of stalls examining workhorses and prize hogs may seem to … have been a strange way for a scientist to spend an afternoon, but there was a certain logic to it.

hogs may seem a bit strange through rows of stalls

Quanze)Chen,)Chenyang)Lei,)Wei)Xu,)Ellie)Pavlick)and)Chris)CallisonUBurch.)“Poetry)of)the)Crowd:)A)Human)Computa/on)Algorithm)to)Convert)Prose)into)Rhyming)Verse”)In)AAAI's)HCOMP)(2012)

[Rhyme]!balls falls

installs walls

Page 54: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Text Simplification

state-of-the-art (since 2010)

NSF)EAGER:)“Simplifica/on)as)Machine)Transla/on”)(2014)~)2015))

Page 55: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Text Simplification

state-of-the-art (since 2010)is suboptimal !

is not all that simpleWei)Xu,)Chris)CallisonUBurch.)“Problems)in)Current)Text)Simplifica/on)Research:)New)Data)Can)Help”))to)appear)in)TACL)(2015)))

NSF)EAGER:)“Simplifica/on)as)Machine)Transla/on”)(2014)~)2015))

Page 56: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

• Jointly model word-sentence via latent variables!!• Use Twitter as a powerful paraphrase resource!!• Systemize a framework for language generation!

!• Right the direction of text simplification research

all#code#and#data#are#available#on#my#homepage:##h<p://www.cis.upenn.edu/~xwe/

Main Contributions

Page 57: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

The Ideal

Page 58: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

CollaboratorsChris Callison-Burch

Ralph Grishman Bill Dolan

Alan Ritter Raphael Hoffmann

Joel Tetreault Le Zhao

Maria Pershina Martin Chodorow

Colin Cherry Yangfeng Ji Ellie Pavlick

Mingkun Gao Quanze Chen

UPenn NYU MSR UW / OSU UW / AI2 Incubator ETS / Yahoo! CMU / Google NYU CUNY NRC GaTech UPenn UPenn UPenn

Page 59: Learning and Generating Paraphrases From Twitter and Beyondmt-class.org/penn/slides/guest-lecture-wei-xu.pdf · 2016. 10. 21. · Research Overview TACL 15 !! NAACL 15!! TACL 14 !!

Thank you

thanks

thanking youappreciate it

thnx

thxtyvm

thank you very much

thanks a lot

3x

say thanks

am gratefulwawwww thankkkkkkkkkkk you alotttttttttttt!

thank u 4 ur time

gratitude