14
face of the stopwords The February 2014 Monthly Tomek Sobczak

Stopwords in Search

Embed Size (px)

Citation preview

face of the stopwordsThe February 2014 MonthlyTomek Sobczak

what are stop words?

what are stop words?

what are stop words?

common wisdom

• they are everywhare and bloat index

• remove them to increase performance (smaller index and query) and relevance of search results

common wisdom

• they are everywhare and bloat index

• remove them to increase performance (smaller index and query) and relevance of search results

• … but sometimes stop words add little semantic to a sentence

• … and sometimes you need them - To be or not to be

common wisdom

• they are everywhare and bloat index

• remove them to increase performance (smaller index and query) and relevance of search results

• … but sometimes stop words add little semantic to a sentence

• … and sometimes you need them - To be or not to be

• having the best of both worlds? multiple mappings of data: one with stop words removed and one with stop words

common wisdom

• they are everywhare and bloat index

• remove them to increase performance (smaller index and query) and relevance of search results

• … but sometimes stop words add little semantic to a sentence

• … and sometimes you need them - To be or not to be

• having the best of both worlds? multiple mappings of data: one with stop words removed and one with stop words doubled data by indexing in two different ways!

• Common Terms Query analyzes query, identifies whichwords are “important” based on document frequencies for each term

• Common Terms Query leverage the power of stop wordremoval (faster searches) without eliminating them (theycan contribute to score sometimes)

• Common Terms Query adapts to your domain, wordswith high frequency will automatically be recognized as stop words

restoring stop words

possibility of improving

• searches comprised only of stopwords (improved recall)• to be or not to be• The Who

• searches for short searches including stopwords (improved precison)• pearl vs. the pearl• the one• a zukofsky (author Zukofsky, title "a")

• distinguish "in" from "and” in some cases• archaeology in literature != archaeology and literature

restoring stop words

possibility of improving

• searches comprised only of stopwords (improved recall)• to be or not to be• The Who

• searches for short searches including stopwords (improved precison)• pearl vs. the pearl• the one• a zukofsky (author Zukofsky, title "a")

• distinguish "in" from "and” in some cases• archaeology in literature != archaeology and literature

possibility of degrading

• long queries (over 6 terms) with a lot of stopwords have reduced precision• Lectures on the Calculus of Variations and Optimal Control Theory• BUT: the words occurring as a phrase float to the top• AND: you can modify minimum match (mm) param

restoring stop words

how to decide?

• take a look at your business knowledge domain

• count percent of searches with stop words

• count terms in user queries

Thank you!