21
Search Technologies

Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Embed Size (px)

Citation preview

Page 1: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Search Technologies

Page 2: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Examples

• Fast Google Enterprise– Google Search Solutions for business– Page Rank

• Lucene– Apache Lucene is a high-performance, full-featured text

search engine library written entirely in Java• Solr– Solr is the popular, blazing fast open source enterprise

search platform from the Apache Lucene project.

Page 3: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Search Engine Ranking Criteria

Page 4: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Yahoo!

• been in the search game for many years.• is better than MSN but nowhere near as good as

Google at determining if a link is a natural citation or not.

• has a ton of internal content and a paid inclusion program, both of which give them incentive to bias search results toward commercial results

• things like cheesy off topic reciprocal links still work great in Yahoo!

Page 5: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

MSN (bing)

• new to the search game• is bad at determining if a link is natural or artificial in nature• due to sucking at link analysis they place too much weight on

the page content• their poor relevancy algorithms cause a heavy bias toward

commercial results• likes bursty recent links• new sites that are generally untrusted in other systems can

rank quickly in MSN Search• things like cheesy off topic reciprocal links still work great in

MSN Search

Page 6: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Google• has been in the search game a long time, and saw the web graph when it is much

cleaner than the current web graph• is much better than the other engines at determining if a link is a true editorial

citation or an artificial link• looks for natural link growth over time• heavily biases search results toward informational resources• trusts old sites way too much• a page on a site or subdomain of a site with significant age or link related trust can

rank much better than it should, even with no external citations• they have aggressive duplicate content filters that filter out many pages with

similar content• if a page is obviously focused on a term they may filter the document out for that

term. on page variation and link anchor text variation are important. a page with a single reference or a few references of a modifier will frequently outrank pages that are heavily focused on a search phrase containing that modifier

• crawl depth determined not only by link quantity, but also link quality. Excessive low quality links may make your site less likely to be crawled deep or even included in the index.

• things like cheesy off topic reciprocal links are generally ineffective in Google when you consider the associated opportunity cost

Page 7: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Ask

• looks at topical communities• due to their heavy emphasis on topical

communities they are slow to rank sites until they are heavily cited from within their topical community

• due to their limited market share they probably are not worth paying much attention to unless you are in a vertical where they have a strong brand that drives significant search traffic

Page 8: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

History

• SMART– Salton’s Magic Information Retrieval of Text– Vector Space Model– Relevance feedback algorithm (customization)– Latent Semantic Indexing (LSI)

Page 9: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Basic Vector Space Algo

• Vanilla Search Algo• Key word search (ignore search modifiers e.g.

not, and, this, their, is, or, of, and stop words• Remove punctuation marks• Reduce words to their root form (stemming)– Combination of suffix and prefix – Eg: students student

swam swimlemmatization stochastic algorithmscience, scientist??

Page 10: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Documents to be indexed

• Document 1– Search technologies have been around for over

forty years. Over this time, their user base expanded first from scientists and technologists to information professionals, and finally from information professionals to pretty much everyone.

Page 11: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

• Document 2– Math and Physics students are familiar with the

challenge of finding the unambiguous “right answer”. The same is not true for information retrieval. Finding the “right document” may be as much art as science.

Page 12: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

• Document 3– Many serial killers do not suffer from psychosis

and appear to be quite normal. Search for such killers can take years, even with the latest police technologies, and the results are often shocking.

Page 13: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Stop words for removal

• Search technologies have been around for over forty years. Over this time, their user base expanded first from scientists and technologists to information professionals, and finally from information professionals to pretty much everyone.

• Math and Physics students are familiar with the challenge of finding the unambiguous “right answer”. The same is not true for information retrieval. Finding the “right document” may be as much art as science.

• Many serial killers do not suffer from psychosis and appear to be quite normal. Search for such killers can take years, even with the latest police technologies, and the results are often shocking.

Page 14: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Stemming Changes Identified

• search technology around forty years time user base expanded first science technology information professionals finally information professionals pretty much everyone

• math physics students familiar challenge finding unambiguous right answer information retrieval finding right document much art science

• many serial killers suffer psychosis appear normal search killers take years latest police technology results shocking

Page 15: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Unique words identified

• Search[1] technology[2] around[3] forty[4] year[5] time[6] user[7] base[8] expand[9] first[10] science[11] technology[2] information[12] professional[13] final[14] information[12] professional[13] pretty[15] much[16] everyone[17]

• math[18] physics[19] student[20] familiar[21] challenge[22] find[23] unambiguous[24] right[25] answer[26] information[12] retrieval[27] find[23] right[25] document[28] much[16] art[29] science[11]

• many[30] serial[31] killer[32] psychosis[33] appear[34] normal[35] search[1] killer[32] take[36] year[5] latest[37] police[38] technology[2] result[39] shock[40]

Page 16: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Search Ditionary

[1] search [2] technology [3] around [4] forty [5] year [6] time………[40] shock

Page 17: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Representing documents as 40-dimensional vectors

• Values are in form of <dictionary ref>:<no of occurrences>

• Doc1(1:1, 2:2, 3:1, 4:1, 5:1, 6:1, 7:1,….,13:2,14:1, 15:1,…, 17:1, 18:0, 19:0,…,40:0)

• Doc2(1:0, 2:0, 3:0,…,11:1,12:1,…,16:1,17:0,18:1, 19:1, 20:1,..,29:1,30:0,31:0,….,40:0)

• Doc3(1:1,2:1,3:0,4:0,5:1,6:0,7:0,8:0,…,29:0, 30:1,31:2,32:2,33:1…,40:1)

Page 18: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Handling the Query

• “the promise of search technologies”• the promise of search technology• search and technology are present in dictionary, but

“promise” is not so it will be avoided• Hence the search becomes search technology, which is

equivalent to (1:1, 2:1)....creating a new vector• Converting it to 40 dimensional array (1:1, 2:1, 3:0, 4:0,

….,40:0)• Finally find the shortest distance (best match) between

previously stored vectors.

Page 19: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Enhancements• Weighting multiple occurrences

– (1:1000, 2:1000)• Weighting for phrases

– Search technology– Police technology– Information professional– Information retrieval

• Word clustering– Search/retrieval/find– Technology/science/math/physics– First/final/latest

• Custom biases

Page 20: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Google Page ranking

• PR(A) = (1-d)+d (PR(T1)/C(T1) + ….. + PR(Tn)/C(Tn))

A page in questionT1…Tn documents that reference

PR page rankC(Ti) total number of links to outside

resources on page TiD heuristic damping factor usually set to

0.85

Page 21: Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Web Spiders

• Selection policy• Re-visit policy