27
Automatic Search Event by Automatic Keyword Extraction Xiwei Yan 08-10-2016

Automatic Search Event-Summary

Embed Size (px)

Citation preview

Page 1: Automatic Search Event-Summary

Automatic Search Event by

Automatic Keyword Extraction

Xiwei Yan08-10-2016

Page 2: Automatic Search Event-Summary

Overview

Ads landing pages Html source code Text

Keyword & Key PhrasesSimilar WebpagesAudience

Page 3: Automatic Search Event-Summary

Motivation

• Automate the search events (free BA from manually generating the keywords)

• Identify users for campaigns that don’t have pixels

Page 4: Automatic Search Event-Summary

A First Glimpse at Result

Page 5: Automatic Search Event-Summary

Approach

• Preprocessing

• Keyword Extraction models

– TF-IDF

– TextRank

– Word2Vec + TextRank

– TextRank + Word2Vec

Page 6: Automatic Search Event-Summary

Approach 1 - TFIDF• Preprocessing

• Lower case, lemmatize, stop words, punctuation, tokenization, tag and filter by part-of-speech tags

• Keyword Extraction models

– TF-IDF • TF-IDF(w, d, n, N) = TF(w, d) * IDF(n, N)

– TF(w, d) = # times word w occurred

in doc d

– IDF(n, N) = # docs the word w appears

Word Term freq in doc1

Appear in # docs

Tfidf

car 27 3 0

auto 3 2 1.216

Insurance 0 2 0

Best 14 2 5.676

Page 7: Automatic Search Event-Summary

Approach 2 - TextRank

• PreprocessingLower case, lemmatize, stop words, punctuation, tokenization, tag and filter by part-of-speech tags

• Identify Structurally important Keyword

• Iteratively Calculate:

d is the damping factor that usually set to 0.85

Page 8: Automatic Search Event-Summary

Approach 2 - TextRank

geico

auto

insurance

policy

privacy

find

car

coverage

call

sevice

1

1

1

1

1

1

1

1

1

1

geico

auto

insurance

policy

privacy

find

car

coverage

call

sevice

0.32

0.32

2.65

0.49

2.65

2.19

0.36

0.32

0.32

0.36

first iteration

𝑆 (𝑉 𝑖 )= (1−𝑑 )+𝑑∗ ∑𝑗 ∈𝑛𝑔𝑏𝑟 (𝑉 𝑖)

1|𝑑𝑒𝑔𝑟𝑒𝑒 (𝑉 𝑗 )|

𝑆 (𝑉 𝑗 )

𝑆𝑐𝑜𝑟𝑒 (𝑔𝑒𝑖𝑐𝑜 )=0.15+0.85∗( 11∗1+ 11∗1+ 12∗1+ 15 ∗1+ 14 ∗1)=2.65service call auto insurance policy

𝑆𝑐𝑜𝑟𝑒 (𝑝𝑜𝑙𝑖𝑐𝑦 )=0.15+0.85∗( 11∗1+ 11∗1+ 15 ∗1+ 15 ∗1)=2.19find privacy insurance geico

5

5

4

2

1

1

1

1

1

1

𝑆𝑐𝑜𝑟𝑒 (𝑠𝑒𝑟𝑣𝑖𝑐𝑒 )=0.15+0.85∗( 15∗1)=0.32geico

iterations

d is the damping factor that usually set to 0.85

Page 9: Automatic Search Event-Summary

Approach 2 - TextRank

geico

auto

insurance

policy

privacy

find

car

coverage

call

sevice

0.51

0.51

2.12

0.87

2.12

1.77

0.52

0.51

0.51

0.52

geico

auto

insurance

policy

privacy

find

car

coverage

call

sevice

0.51

0.51

2.12

0.87

2.65

1.75

0.52

0.51

0.51

0.52

Converge

𝑆 (𝑉 𝑖 )= (1−𝑑 )+𝑑∗ ∑𝑗 ∈𝑛𝑔𝑏𝑟 (𝑉 𝑖)

1|𝑑𝑒𝑔𝑟𝑒𝑒 (𝑉 𝑗 )|

𝑆 (𝑉 𝑗 )

service call auto insurance policy

𝑆𝑐𝑜𝑟𝑒 (𝑝𝑜𝑙𝑖𝑐𝑦 )=0.15+0.85∗( 11∗0.52+11 ∗0.52+ 15 ∗2.12+ 15∗2.12)=1.75find privacy insurance geico

5

5

4

2

1

1

1

1

1

1

𝑆𝑐𝑜𝑟𝑒 (𝑠𝑒𝑟𝑣𝑖𝑐𝑒 )=0.15+0.85∗( 15∗2.12)=0.51geico

10iterations

Converge Really Quick! (<= 20 iterations)

d is the damping factor that usually set to 0.85

𝑆𝑐𝑜𝑟𝑒 (𝑔𝑒𝑖𝑐𝑜 )=0.15+0.85∗( 11∗0.51+ 11∗0.51+12∗0.87+ 15∗2.12+ 14 ∗1.77)=2.12

Page 10: Automatic Search Event-Summary

Approach 3 – Word2vec + ?

• Preprocessing• No preprocessing (ideally)

• Keyword Extraction models– Word2Vec + Clustering

Projection matrix

Page 11: Automatic Search Event-Summary

0100...00

.

.

.001000

.

.

.000010

000001...

.9

.8

.1

.

.

.

.

.1

5V*1W

(t)

W(1)

W(t-1)

W(2)

.…..

D*V

D*1

Continuous Bag-of-Words Model +

Negative SamplingThe

cat

on

that

Projection Matrix W

sitscoversampleinputpredictlearnbelievetypefivedesignhuman

Cost Function:

Backpropagation:

Gradient Descent:

softmax

0.3660.20.1030.1000.0090.0110.0450.0500.0700.0100.009Projection

Matrix W’

Page 12: Automatic Search Event-Summary

Approach 3 – Word2vec + Clustering

• k-means• DBSCAN

Page 13: Automatic Search Event-Summary

Approach 3 – Word2vec + TextRank

W(1)

N*D

W(2)

W(3)

W(4)

W(n-2)

W(n-1)W(n)

……………………………………

johndeerecompactutilitytractortaylormessickInc................companyprofileagriculturalequipment

tractortillage

mowerexcavator

sprayer

shredderagriculture

harvest

mowerexcavatorshredder tillageharvestsprayer

Document Text

Trained Word2vec Model

TextRank

• Identify semantically important Keyword

Page 14: Automatic Search Event-Summary

Approach 4 –TextRank + Word2vecWord TextRank

Scoretractor 0.015847john 0.013281sale 0.012494standard 0.012474equipment 0.010799power 0.009747messick 0.008162new 0.008151work 0.007907series 0.007707mower 0.006099utility 0.006035compact 0.005751

TextRank Result

mower 0.8502excavator 0.7708shredder 0.7451tillage 0.7341harvest 0.7154sprayer 0.7101

Word2vec Similarity

Word New Scoretractor 0.015847

mower 0.015847*0.8502= 0.013433

john 0.013281

sale 0.012494

standard 0.012474

excavator 0.015847*0.7708= 0.012215

shredder 0.015847*0.7451= 0.011808

tillage 0.015847*0.7341= 0.011633

harvest 0.015847*0.7154= 0.011337

sprayer 0.015847*0.7101= 0.011253

equipment 0.010799

power 0.009747

messick 0.008162

new 0.008151

Page 15: Automatic Search Event-Summary

Google’s Pre-trained Word2vecCampaign % Words in Pre-trained

Model Vocab.% Keywords in Pre-trained Model Vocab.

Geico 0.929985 0.88888

Taylor Messick (Agricultural Equipment)

0.929784 0.41176

Trane (AC) 0.922018 0.71428

Page 16: Automatic Search Event-Summary

Model Testing

1. Generate keyword from the 4 models2. Feed into Lucene and find urls3. Track the audience who visited these urls4. Compare the audience we find to the audience the pixels find

Page 17: Automatic Search Event-Summary

Results (Dell) - KeywordTFIDF TextRank Word2vec_Textrank TextRank_Word2vec

office dell outlet dell

dellcom support collaboration acquire

view service acquire laptop

electronics product work desktop

customer price purchase software

dell use spare rebate

representative software poster welding

dellcomreturnspolicy customer transformation windows

dells system apg corporations

information practices new dell please dell software

prosupport dell dell inc poster laptop desktop

products view dell outlet apg transformation dell new

services support dell dell today purchase acquire dell tablet

dell sales dell team spare transformation dell inc

Page 18: Automatic Search Event-Summary

Results (Toyota) - KeywordTFIDF TextRank Word2vec_Textrank TextRank_Word2vec

highlander toyota generate toyota

kbbcom information acquire preowned

edmundscom site misuse certified

certify vehicle tale highlander

information use govern rav

certification program tradein yaris

site email fourwheel avalon

program service generate tale corolla

assistance sale rubbed bologna sequoia

violated please toyota site identify tundra

hybrid highlander toyota vehicle wheel camry

car certification toyota dealer rubbed tale venza

personal information new toyota help toyota vehicle

cruiser preowned toyota certified new avalon preowned

Page 19: Automatic Search Event-Summary

Results - UrlsDell Toyota

http://thetechjournal.com/electronics/laptop/dell-inspiron-15r-laptop.xhtml

http://www.adverts.ie/laptop-parts-and-accessories/dell-laptop-charger-19-5v-4-62a-90w/10838435

http://www.dellservicecentreinchennai.in/tablet-repair-center-medavakkam.html

http://www.dell.com/us/business/p/poweredge-c6320p/pd?oc=&model_id=poweredge-c6320p&l=en&s=bsd

http://forum.notebookreview.com/threads/dell-2012-outlet-coupons.636641/page-21

http://www.macdonaldtoyota.ca/

http://www.stcharlestoyota.net

http://www.baldwintoyotaofpoplarbluff.com/

http://www.lafontainetoyota.com/

http://www.cedarrapidstoyota.com/

http://www.craigtoyota.com/

http://www.planettoyotaonline.com/

http://www.gatewaytoyotapierre.com/

Page 20: Automatic Search Event-Summary

Result - # of Converters

Page 21: Automatic Search Event-Summary

Result - % of Converters

CampaignId TFIDF TextRank TextRank_Word2vec

Word2vec_TextRank

13405 25 (0.2%) 99 (0.8%) 44 (0.4%) 1 (0.008%)

13553 229 (3.2%) 269 (3.7%) 252 (3.5%) 8 (0.1%)

14099 6 (0.03%) 57 (0.3%) 16 (0.08%) 2 (0.01%)

14545 247 (3%) 250 (3%) 482 (5.7%) 7 (0.08%)

15077 0 (0%) 4 (0.02%) 15 (0.08%) 6 (0.03%)

Page 22: Automatic Search Event-Summary

Conclusion

• TextRank and TextRank_Word2vec consistently perform better than TFIDF

• TextRank don’t require extra space for model saving

• All 3 models need O(n) computational time

Page 23: Automatic Search Event-Summary
Page 24: Automatic Search Event-Summary

Appendix

Page 25: Automatic Search Event-Summary

0100...00

.

.

.001000

.

.

.000010

.

.

.000010

000001...

.1

.3

.7

.4

.9

.

.

.2

.01

.9

.2

.

.

.

.4

.5

.9

.8

.1

.

.

.

.

.1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

5V*1

W(t)

W(1)

W(t-1)

W(2)

.…..

D*V

5D*1

.

.

.

.

.

.

.

.

.

.tanhHidden Layer

0.003...........0.0000.0090.0110.0450.0000.0000.3660.010....................0.0100.0000.000

Apple...........Computerpointtrafficinboxpolicyprintcouchchoice....................chooselatermedia

Output layer

softmax

Most Computation

Neural Net Language

Model

Maximize

Time Complexity

The

cat

sits

on

that

Projection Matrix

Page 26: Automatic Search Event-Summary

0100...00

.

.

.001000

.

.

.000010

.

.

.000010

000001...

.1

.3

.7

.4

.9

.

.

.2

.01

.9

.2

.

.

.

.4

.5

.9

.8

.1

.

.

.

.

.1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

5V*1

W(t)

W(1)

W(t-1)

W(2)

.…..

D*V

5D*1

.

.

.

.

.

.

.

.

.

.tanhHidden Layer

Hierarchical Probabilistic Neural Net

Language ModelThe

cat

sits

on

that

Projection Matrix

TV

Computer

couch

table

make

choose

print

write

Page 27: Automatic Search Event-Summary

0100...00

.

.

.001000

.

.

.000010

000001...

.9

.8

.1

.

.

.

.

.1

5V*1W

(t)

W(1)

W(t-1)

W(2)

.…..

D*V

D*1

Continuous Bag-of-Words

ModelThe

cat

on

that

Projection Matrix

TV

Computer

couch

table

make

choose

sits

crawl