65
1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

Embed Size (px)

Citation preview

Page 1: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

1

Identifying Sets of RelatedWords from the World Wide Web

Thesis Defense 06/09/2005

Pratheepan (Prath) Raveendranathan

Advisor: Ted Pedersen

Page 2: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

2

Outline

• Introduction & Objective

• Methodology

• Experimental Results

• Conclusion

• Future Work

• Demo

Page 3: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

3

Introduction

• The goal of my thesis research is to use the World Wide Web as a source of information to identify sets of words that are related in meaning.– Example, given two words - {gun,pistol}

a possible set of related words would be

{handgun, holster, shotgun, machine-gun, weapon,ammunition,bullet, magazine }

– Example, given two words – {toyota, nissan, ford}A possible set of related words would

{honda, gmc, chevy, mitsubishi}

Page 4: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

4

Examples Cont…

– Example, given two words - {red,yellow}

a possible set of related words would be{ white,black,blue, colors, green}

– Example, given two words - {George Bush,Bill Clinton}

a possible set of related words would be{ Ronald Reagan, Jimmy Carter, White House, Presidents, USA, etc }

Page 5: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

5

Application

• Use sets of related words to classify Semantic Orientation of reviews.

(Peter Turney)

• Use sets of related words to find the sentiment associated with particular product.

(Rajiv Vaidyanathan and Praveen Agarwal).

Page 6: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

6

Pros and Cons of using the Web

• Pros– Huge amounts of text– Diverse text

• Encyclopedia’s, Publications, Commercial Web Pages– Dynamic (ever-changing state)

• Cons,– The Web creates a unique set of challenges,– Dynamic (ever-changing state)

• News websites, Blogs– Presence of repetitive, noisy, or low-quality data.

• HTML tags, web lingo (home page, information etc)

Page 7: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

7

Contributions

• Developed an Algorithm that predicts sets of related words by using pattern matching techniques and frequency counts.

• Developed an Algorithm that predicts sets of related words by using a relatedness measure.

• Developed an Algorithm that predicts sets of related words by using a relatedness measure and an extension of the Log Likelihood score.

• Applied sets of related words to problem of Sentiment Classification.

Page 8: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

8

Outline

• Introduction & Objective

• Methodology

• Experimental Results

• Conclusion

• Future Work

• Demo

Page 9: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

9

Interface to Web - Google

– Reasons for using Google

• Research is very much dependant on both the quantity and quality of the Web content.

• Google has a very effective ranking algorithm called PageRank which attempts to give more important or higher quality web pages a higher ranking.

• Google API – An interface which allows programmers to query more than 8 billion web pages using the Google search engine. (http://www.google.com/apis/).

Page 10: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

10

Problems with Google API

• Restricted to 1000 queries a day• 10 Results for each query• No “near” operator (Proximity based search)• Maximum 1000 results.

• Alternative– Yahoo API – 5000 Queries a day (Released very recently)

• No “near” operator as well.• Cannot retrieve number of hits.

Note: Google was used only as means of retrieving from theInformation.

Page 11: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

11

Key Idea behind Algorithms

• Words that are related in meaning often tend to occur together.– Example,

A Springfield, MA , Chevrolet, Ford, Honda, Lexus, Mazda, Nissan, Saturn, Toyota automotive dealer with new and pre-owned vehicle sales and leasing

Page 12: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

12

Algorithm 1

• Features• Based on frequency• Takes only single words as input• Initial set 2 words• Frequency cutoff• Ranked by frequency• Smart stop list -

– The, if, me, why, you etc (non-content words)

• Web stop list – Web page, WWW, home,page, personal, url, information, link, text ,

decoration, verdana, script, javascript

Page 13: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

13

Algorithm 1 – High level Description

1. Create queries to Google based on the input terms.

2. Retrieve the top N number of web pages for each query.1. Parse the retrieved web page content for each query.

3. Tokenize web page content into list of words and frequency. 1. Discard words that occur less than C number of times.

4. Find the common words between at least two of the sets of words. This set of intersecting words are the set of related words to the input term.

5. Repeat the process for I iterations by using the set of related wordsfrom the previous iteration as input.

Page 14: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

14

Algorithm 1 Trace 1

• Search Terms : S1={pistol, gun}

• Frequency Cutoff – 15

• Num Results (Web Pages) – 10

• Iterations - 2

Page 15: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

15

Algorithm 1 –Step 1

1. Create queries to Google based permutations of the Input Terms,

– gun – gun AND pistol– pistol– pistol AND gun

Page 16: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

16

Algorithm 1 – Step 2

2. Issue query to Google, 1. Retrieve the top 10 URLs for the query,

1. For each URL, retrieve the web page content, and parse the web page for more links.

2. Traverse these links and retrieve the content of those web pages as well.

Repeat this process for each query.

Page 17: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

17

Trace 1 Cont…

• Web pages for the query gun

gunhttp://www.thesmokinggun.com/http://www.gunbroker.com/http://www.gunowners.org/http://www.ithacagun.com/http://www.doublegun.com/http://www.imdb.com/title/tt0092099/http://www.imdb.com/Title?0092099http://www.gunandgame.com/http://www.gunaccessories.com/http://www.guncite.com/

Page 18: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

18

Trace 1 Cont…

• Web pages for pistol

pistolhttp://www.idpa.com/http://www.bullseyepistol.com/http://www.crpa.org/http://www.zvis.com/dep/dep.shtmlhttp://www.nysrpa.org/http://www.auspistol.com.au/http://hubblesite.org/newscenter/newsdesk/archive/releases/1997/33/http://en.wikipedia.org/wiki/Pistolhttp://www.imdb.com/title/tt0285906/http://www.fas.org/man/dod-101/sys/land/m9.htm

Page 19: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

19

Trace 1 Cont…

• Web pages for gun AND pistol

gun AND pistolhttp://www.usgalco.com/http://www.minirifle.co.uk/http://www.dypic.com/gunsafepistol.htmlhttp://www.datacity.com/handgun-pistol-case.htmlhttp://www.camping-hunting.com/http://www.pelican-case.com/pelguncaspis.htmlhttp://www.cafepress.com/4funnystuff/566642http://www.nimmocustomarms.com/http://www.bullseyegunaccessories.com/http://www.airsoftshogun.com/P_224.htm

Page 20: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

20

Trace 1 Cont…

• Web pages for pistol AND gun

pistol AND gunhttp://www.safetysafeguards.com/:http://www.safetysafeguards.com/site/402168/page/57955:http://www.safetysafeguards.com/site/402168/page/57959:http://www.airguns-online.co.uk/:http://www.dypic.com/gunsafepistol.html:http://www.airgundepot.com/eaa-drozd.html:http://www.docs.state.ny.us/DOCSOlympics/Combat.htm:http://www.datacity.com/handgun-pistol-case.html:http://www.sail.qc.ca/catalog/detail.jsp?id=2880:http://portfolio-pro.com/pistolhandgun.html:here also

Page 21: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

21

Algorithm 1 – Step 3

3. Next, for the total web page content retrieved for each query,

1. Remove HTML Tags etc and retrieve text.

2. Remove stop words.

3. Tokenize the web page content into lists of words and frequency.

Note: This would result in the following 4 sets of words,

each set representing the words retrieved for each

query.

Page 22: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

22

Words from Web pages after removing stop words

gun pistol shotgun, 15 shooting, 25 hobbies, 18 chelmsford, 15 electronic, 24 hand, 66 mounts, 21 dep, 16 rifle, 120 pelican, 56 option, 60 normal, 48 daily, 33 eagle, 20 pistols, 35 auto, 18 biometric, 20 technical, 16 holsters, 27 desert, 19 practical, 69 club, 56 hspace, 24 imgcounter, 27 parts, 15 crpa, 17 shotgun, 79 holster, 15 menus, 15 security, 20 systems, 24 trigger, 24 foam, 27 ddd, 21 small, 17 control, 31 ipsc, 18 cases, 56 guns, 38 members, 19 cases, 33 case, 82 shooting, 123 middle, 740 catalog, 371 bullets, 17 essex, 30 target, 22 cases, 35 category, 370 reloading, 16 hobby, 18 bullets, 22 shoes, 16 order, 17 military, 19 ruger, 38 airsoft, 28 safes, 62 auto, 20 rifle, 21 ukpsa, 22 sport, 28 airsoft, 50 addtab, 30 care, 20 clubs, 19 safe, 29 vspace, 18 paintball, 20 grips, 31 semi, 18 range, 19 soft, 22 pro, 36 knives, 44 guns, 72 mini, 25 null, 1051 safety, 53 tactical, 24 bullet, 42 shoot, 31 travel, 15 boots, 24 stocks, 23 forum, 18 diversion, 21 false, 30 optics, 29 advertise, 16 air, 70 safe, 70 shooting, 19 pictures, 17 rifle, 29 money, 15 scope, 16 dealers, 17 family, 59 uploaded, 17 accessories, 53 riffles, 22 shopping, 16 fingerprint, 27

firearms, 22 case, 37 accessories, 59 ammo, 23 silver, 17

gun AND pistol pistol AND gun

Page 23: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

23

Algorithm 1 – Step 4

4. Find the words that are common at least 2 sets.

Let,A. gun AND pistolB. pistol AND gunC. gunD. pistol

Related Set =

Page 24: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

24

Related Set 1 – Iteration 1

Result Set 1 rifle , 177 shooting , 169 case , 127 accessories , 126 cases , 124 guns , 123 safe , 100 shotgun , 97 airsoft , 78 auto , 41 bullets , 40

Page 25: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

25

Trace 1 Cont… Iteration 2

• 11 input terms – – Search terms created –

• Rifle• Shooting• Guns• Cases• Airsoft• Shooting AND Guns• Guns AND Shooting• Guns AND Cases

etc etc.Results in 112 = 121 queries to Google!

Note: As you can see, the number of queries to Google increasesdrastically.

Page 26: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

26

Result Set 2 – {gun, pistol}

pistols,227 holster,118 tac,79 firearms,205 fits,118 radio,77 accessories,204 shoot,117 paintball,75 free, 192 sport,115 assault,71 holsters,172 hours,109 teflon,70 club,170 usa,109 pouch,69 target,164 ammo,107 number,69 tactical,161 electric,107 shoulder,69 air,158 ships,106 leg,64 practical,152 spring,103 core,62 range,150 articles,96 essex,60 court,149 carry,95 nylon,57 uk,147 ruger,93 flash,55 sports,145 force,92 bullets,53 law,143 mp,90 trigger,50 price,142 remote,90 straps,46 full,140 car,89 helicopter,45 control,140 harlow,88 riffles,44 soft,124 magazines,87 coat,44 military,121 belt,86 ukpsa,44 custom,120 mini,82

Page 27: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

27

Algorithm 1 – {red, yellow}

enterprise , 411software , 257solutions , 151management , 142technology , 141system , 96services , 89netherlands , 84fellow , 76applications , 71snake , 70performance , 64scarlet , 62project , 34 organizations , 33 organization , 29 coral , 28 black , 28 blue , 27

Number of Results – 10Frequency Cutoff - 15Iterations - 1

Related Words

Page 28: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

28

Problems with Algorithm 1

• Frequency based ranking,

• Number of input terms restricted to 2,

• Input and output restricted to single words

Page 29: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

29

Algorithm 2

• Features• Based on frequency & relatedness score• Can takes input as single words or 2 word collocations• Relatedness measure based on Jiang and Conrath• Frequency cutoff and relatedness score cutoff• Ranked by score• Initial set can be more than 2 words• Bi-grams as output• Smart stop list

– The, if, me, why, you etc• Web stop words + phrases

– Web page, WWW, home page, personal, url, information, link, text , decoration, verdana, script, javascript

Page 30: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

30

Algorithm 2 – High level Description

1. Repeat same steps as in Algorithm 1 to retrieve initial set of related words (Add bigrams to results as well).

2. For each word returned by Algorithm 1 as a related word,1. Calculate Relatedness of word to input terms.

2. Discard any word or bigram with a relatedness score greater than the score cutoff.

3. Sort remaining terms from most relevant to irrelevant.

3. Repeat Steps 1 – 2 for each iteration, using the set of words from iteration previous iteration as input.

Page 31: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

31

Relatedness Measure (Distance Measure)

• Relatedness (Word1, Word2) = log (hits(Word1)) + log (hits(Word2)) – 2 * log (hits(Word1 Word2))

(Based on measure by Jiang and Conrath)• Example 1,

hits(toyota) = 12,500,000 hits(ford) = 22,900,000 hits(toyota AND ford) = 50,000

= 32.41• Example 2,

hits(toyota) = 12,500,000 hits(ford) = 22,900,000 hits(toyota AND ford) = 150,000

= 30.82

Page 32: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

32

Relatedness Measure Cont…

• Example 3,hits(toyota) = 1000

hits(ford) = 1000hits(toyota AND ford) = 1000

Relatedness (toyota,ford) = 0

As the measure tends to approach zero, the relatednessbetween the two terms increase.

Page 33: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

33

Input Set – {gun, pistol}

Algorithm 1 Algorithm 2shooting , 169 shotgun , 16.40, guns , 124 rifle , 18.01,rifle , 113 holster , 19.31,case , 81 ammo , 19.61, accessories , 74 shooting , 22.21,cases , 74 bullets , 22.80,airsoft , 72 air , 24.88,products , 68 holsters , 25.04,bullet , 53 airsoft , 25.79,air , 50 gun cases , 26.02,shotgun , 46 accessories , 26.99 ,holsters , 46 guns , 28.42, ammo , 37 equipment , 29.32, bullets , 34 remington , 29.37,

Page 34: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

34

Algorithm 2 – {red, yellow}

Number of Results – 10Frequency Cutoff - 10Score Cutoff - 30Iterations - 1

blue , 16.77 black , 17.07 scarlet , 24.91 coral , 28.97

Page 35: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

35

Problems with Algorithm 2

• Certain bigrams are not good collocations,– For example,

{sunny, cloudy}

Number of Results - 10

Frequency Cutoff - 15

Bigram Cutoff - 4

Score Cutoff - 30

clear , 24.35 partly cloudy , 25.85 forecast text 26.66 partly sunny , 26.92 light , 27.33 , bulletin fpcn , 28.33 wind , 28.84 winds , 29.22

Page 36: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

36

Algorithm 3 – High Level Description

1. Repeat same steps as in Algorithm 1 to retrieve initial set of related words (Add bigrams to results as well).

2. For each term returned by Algorithm 1 as a related word,1. If the term is a bigram,

1. Validate if bigram is a valid collocation1. If bigram is a valid collocation continue with step 2.2

else 2. Remove term from set of related words.

2. Calculate Relatedness of word to input terms.3. Discard any word or collocation with a relatedness score greater

than the score cutoff.4. Sort remaining terms from most relevant to irrelevant.

Page 37: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

37

Verifying Bigrams

• Adapt Log Likelihood (G2) Score to web hit counts– Example, “New York”

– 4 Queries to Google

“* York”

“of the”

“New *”“New York”

York Not YorkNew 607 2953 3560Not New 14 2096 2110

621 5049 5670

Page 38: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

38

Expected Values

York Not YorkNew 389.9047619 3170.0952Not New 231.095238 1878.9048

(621 * 3560) / 5670

(5049 * 3560) / 5670

(5049 * 2110) / 5670(621 * 2110) / 5670

Page 39: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

39

Identifying a “bad” collocation

• Bigram is discarded if,– Observed value for bigram is 0 (eg, “New York”)– Observed value for bigram is less than the expected

value.

Page 40: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

40

Example Bigrams

Page 41: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

41

Methodology

• Introduction & Objective

• Methodology

• Experimental Results & Evaluation

• Conclusion

• Future Work

• Demo

Page 42: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

42

Evaluating Results

• Compare with Google Sets– http://labs.google.com/sets

• Human Subject Experiments– Around 20 people expanded 2-word sets to what they

feel as a set of related words

Page 43: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

43

F-measure, Precision and Recall

Page 44: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

44

Comparison of Algorithm 1 & 2

{toyota, ford} {toyota, ford} {toyota, ford, nissan} Frequency Cutoff - 5 Frequency C- 5, Score C - 30 Frequency C- 5, Score C - 30 truck , 66 gm , 19.09 mazda , 19.59 car , 61 nissan , 20.15 honda , 19.92 sales , 59 car , 29.77 chevrolet , 21.37 parts , 46 bmw , 22.47 vehicles , 45 dodge , 22.83 year , 43 lexus , 23.05 cars , 35 mitsubishi , 23.17 auto , 32 pontiac , 23.89 motors , 30 mercedes , 24.56 general , 27 gmc , 25.14 company , 24 vehicles , 27.77 honda , 20 service , 20 automotive , 18 nissan , 18 trucks , 17 consumer , 17 detroit , 13 marketing , 13 volvo , 12 media , 12 buyers , 12 focus , 11

Page 45: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

45

Algorithm 1

Google Hack Google Sets michael , 174 Chicago bulls , 148 Jordan nba , 97 Israel game , 56 JOHNSON jersey , 43 Jackson Kuwait JANESVILLE Iraq Japan Lebanon Egypt Springfield

{jordan,chicago}

Number of Results – 10Frequency Cutoff - 15Iterations - 1

Precision = 0, Recall = 0

F-measure = 0

Page 46: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

46

Algorithm 2

Google Hack Google Sets Human Subject mazda , 19.59 HONDA benzhonda , 19.92 MAZDA buick chevrolet , 21.37 SUBARU subaru bmw , 22.47 MITSUBISHI mitsubishi dodge , 22.83 DODGE dodge lexus , 23.05 CHEVROLET chevrolet mitsubishi , 23.17 Jeep jeep pontiac , 23.89 Volvo volvo mercedes , 24.56 Buick buick gmc , 25.14 Pontiac pontiac vehicles , 27.77 Suzuki suzuki

holden mitsubishi

Precision = 6/11 = 0.54, Recall = 6/11 = 0.54

F-measure = 0.54

{toyota,ford, nissan}

Number of Results – 10Frequency Cutoff - 10Score Cutoff - 30Iterations - 1

Page 47: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

47

Algorithm 2

Google Hack Google Sets june , 22.90 March july , 24.39 April august , 25.33 June september , 25.50 October march , 25.71 November october , 26.21 December november , 27.09 September april , 27.49 July december , 27.61 August

{january, february, may}

Precision = 9/9 = 1, Recall = 9/9 = 1

F-measure = 1

Number of Results – 10Frequency Cutoff - 10Score Cutoff - 30Iterations - 1

Page 48: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

48

Algorithm 2Google Hack Google Sets prada , 18.17 Gucci moschino , 18.45 Chanel gucci , 18.60 Calvin Klein dkny , 19.00 Prada valentino , 19.72 , Dolce Gabbana chanel , 19.93 Fendi gianni , 20.12 Hugo Boss hugo boss , 20.17 Christian Dior calvin klein , 20.29 Hermes gianni versace , 20.46 Moschino dolce gabbana , 21.76 Donna Karan calvin , 21.97 Ralph Lauren yves saint , 22.10 Valentino dior , 22.37 Louis Vuitton yves , 22.62 Giorgio Armani giorgio armani , 23.04 DKNY hugo , 23.06 Escada fendi , 24.12 Tommy Hilfiger giorgio , 24.64 Tiffany christian dior , 24.86 Givenchy

{armani, versace}

Precision = 11/20 = 0.55,

Recall = 11/43 = .25

F-measure = 0.35

Not Entire Set

Number of Results – 10

Frequency Cutoff - 10

Bigram Cutoff - 4

Score Cutoff - 30

Iterations - 1

Page 49: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

49

Algorithm 2

{artificial intelligence, machine learning}

Google Hack Google Sets neural networks , 20.88 Neural Networks robotics , 21.14 Robotics neural , 21.60 Knowledge Representation data mining , 22.84 Natural Language Processing expert systems , 22.90 Pattern Recognition expert , 24.24 Machine Vision genetic algorithms , 24.30 Programming Languages reasoning , 24.40 Data Mining logic programming , 24.40 Genetic Programming natural language , 24.87 Vision intelligent , 25.68 Natural Language knowledge , 25.89 Intelligent Agents logic , 26.18 People data , 26.21 Publications natural , 26.23 Philosophy genetic , 26.33 Qualitative Physics applications , 26.60 Speech Processing computer , 27.91 Expert Systems knowledge discovery , 28.91 Genetic Algorithms ai , 29.16 Computer Vision case based , 29.83 Computational Linguistics computer science , 30.21 Cognitive Science reinforcement learning , 31.17 Logic Programming

Precision = 9/23 = 0.39,

Recall = 9/48 = 0.1875

F-measure = 0.25

Number of Results – 10

Frequency Cutoff - 10

Bigram Cutoff - 4

Score Cutoff - 32

Iterations - 1

Page 50: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

50

Comparison of Algorithm 2 & 3

{sunny, cloudy}

Algorithm 2 Algorithm 3 clear , 24.35 clear , 24.35 partly cloudy , 25.85 partly cloudy , 25.85 forecast text 26.66 partly sunny , 26.92 partly sunny , 26.92 light , 27.33 , light , 27.33 , wind , 28.84 bulletin fpcn , 28.33 winds , 29.22 wind , 28.84 winds , 29.22

Number of Results – 10

Frequency Cutoff - 10

Bigram Cutoff - 4

Score Cutoff - 30

Iterations - 1

Page 51: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

51

Algorithm 3 - Bigrams

Bigram Observed Value Expected Value Log Likelihood Score nerual networks 617000 144620.81 954551.64 morgan kaufmann 138000 5067.61 692428.92 pattern recognition 419000 248456.79 102193.35 genetic algorithms 129000 75014.81 32818.00 grammatical inference 4590 1474.92 4214.81 based learning 861000 13804761.9 0 computer science 12700000 27947089.94 0 ai magazine 99500 340178.13 0 ai programming 8050 197317.46 0 based reasoning 46300 676825.39 0 case based 150000 12690476.19 0 data mining 1160000 1424162.25 0 expert systems 165000 3705114.63 0 intelligence machine 3650 587160.49 0

{artificial intelligence, machine learning}

Page 52: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

52

Performance of Algorithms

Algorithm 1 Algorithm 2 Algorithm 3F-measure 0.06 0.26 0.29

• F-measure increases, from Algorithm 1 to 3

Page 53: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

53

Sentiment Classification

• Point wise Mutual Information –Information Retrieval Algorithm (PMI-IR) – Peter Turney– Used to classify reviews as being positive or negative in

orientation• Part-of-speech tag the review

• Extract 2-word phrases from text– Adjective followed by a Noun

– Noun followed by a Noun etc.

• Use a positive connotation such as “excellent” and negative connotation such as “poor”, and calculate the Semantic Orientation (SO) for each 2-word phrase,

Page 54: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

54

Example,

• Let, the phrase be “incredible cast”– SO(“incredible cast”)

= log2 (hits(“incredible cast” NEAR “excellent”)) * hits(“poor”)

(hits(“incredible cast” NEAR “poor”)) * hits(“excellent”)

Page 55: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

55

Problem with Current Algorithm

• Words such as “poor” have at least two senses– “poor” as in poverty– “poor” as in not good

Page 56: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

56

Extended PMI-IR

• Used Google instead of AltaVista

• Used AND instead of NEAR

• Extended SO formula– Use multiple pairs of positive and negative connotations

• {excellent, poor}, {good, bad}, {great, mediocre}

Page 57: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

57

A Negative Review for the movie “Planet of the Apes”

Classified by our Algorithm as being Negative

Page 58: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

58

Positive Review for an Audi

Classified by our Algorithm as being Positive

Page 59: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

59

Negative Movie Review

Classified by our Algorithm as being Negative

Page 60: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

60

Performance of Extended PMI-IR

• Algorithm run on 20 reviews (movies and automobiles)

• Overall Accuracy – 75%

Classified as Positive Classified as NegativePositive Reviews 5 5 10Negative Reviews 0 10 10Total # Movies 5 15 20

Page 61: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

61

End Result:

• All of this is available freely on CPAN and

Sourceforge

Google-Hack

Page 62: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

62

Conclusions & Contribution

• Developed 3 Algorithms that try to predict sets of related words– Algorithm 1 was based on frequency– Algorithm 2 was based on a relatedness measure– Algorithm 3 was based on a relatedness measure and

the Log Likelihood score

• Applied sets of related words to Sentiment Classification

Page 63: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

63

Conclusions & Contribution

• Released free PERL package Google-Hack on CPAN and Sourceforge.

• Developed a web interface.

Page 64: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

64

Future Work

• Addition of proximity operator

• Restrict # of web pages traversed

• Find intersection of words through different search engines - Yahoo API

• Use anchor text

Page 65: 1 Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

65

Related URLs

• Research Page– http://www.d.umn.edu/~rave0029/research

• Google-Hack– http://google-hack.sf.net

• CPAN Release– http://search.cpan.org/~prath/WebService-GoogleHack0.15/GoogleHack/GoogleHack.pm

• Web Interface– http://marimba.d.umn.edu/cgi-bin/googlehack/index.cgi