45
THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

Embed Size (px)

Citation preview

Page 1: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

THE COMPLEX TASK OF

MAKING SEARCH SIMPLE

Jaime Teevan (@jteevan)Microsoft ResearchUMAP 2015

Page 2: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015
Page 3: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

THE WORLD WIDE WEB

20 YEARS AGOContent

2,700 websites (14% .com)

ToolsMosaic only 1 year oldPre-Netscape, IE, Chrome4 years pre-Google

Search Engines54,000 pages indexed by Lycos 1,500 queries per day

Page 4: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

THE WORLD WIDE WEB TODAY

Trillions of pages indexed.Billions of queries per day.

Page 5: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

1996We assume information is static.But web content changes!

Page 6: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

SEARCH RESULTS CHANGE

New, relevant content

Improved ranking

Personalization

General instability

Can change during a query!

Page 7: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

SEARCH RESULTS CHANGE

Page 8: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

BIGGEST CHANGE ON THE WEB

Behavioral data.

Page 9: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

It is impossible to separate a cube into two cubes, or a fourth power into two fourth powers, or in general, any power higher than the second, into two like powers. I have discovered a truly marvellous proof of this, which this margin is too narrow to contain.

BEHAVIORAL DATA MANY YEARS AGOMarginalia adds value to books

Students prefer annotated texts

Do we lose marginalia when we move to digital documents?

No! Scale makes it possible to look at experiences in the aggregate, and to tailor and personalize

Page 10: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

PAST SURPRISES ABOUT WEB SEARCH Early log analysis Excite logs from 1997, 1999 Silverstein et al. 1999; Jansen et al. 2000; Broder 2002

Queries are not 7 or 8 words long

Advanced operators not used or “misused”

Nobody used relevance feedback

Lots of people search for sex

Navigational behavior common

Prior experience was with library search

Page 11: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

SEARCH IS COMPLEX, MULTI-STEPPED PROCESS Typical query involves more than one click 59% of people return to search page after their first click Clicked results often not the endpoint

People orienteer from results using context as a guide Not all information needs can be expressed with current tools Recognition is easier than recall

Typical search session involves more than one query 40% of sessions contain multiple queries Half of all search time spent in sessions of 30+ minutes

Search tasks often involves more than one session 25% of queries are from multi-session tasks

Page 12: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

IDENTIFYING VARIATION ACROSS INDIVIDUALS

1 2 3 4 5 60.75

0.8

0.85

0.9

0.95

1

Group Individual

Number of People in Group

Norm

aliz

ed D

CG

1 2 3 4 5 60.75

0.8

0.85

0.9

0.95

1

Group Individual

Number of People in Group

Norm

aliz

ed D

CG

Page 13: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

WHICH QUERY HAS LESS VARIATION? campbells soup recipes v. vegetable soup recipe tiffany’s v. tiffany nytimes v. connecticut newspapers www.usajobs.gov v. federal government jobs singaporepools.com v. singapore pools

Page 14: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

NAVIGATIONAL QUERIES WITH LOW VARIATION Use everyone’s clicks to identify queries with low click entropy12% of the query volumeOnly works for popular queries

Clicks predicted only 72% of the timeDouble the accuracy for the average queryBut what is going on the other 28% of the time?

Many typical navigational queries are not identifiedPeople visit interior pages craigslist – 3% visit http://geo.craigslist.org/iso/us/ca

People visit related pages weather.com – 17% visit http://weather.yahoo.com

Page 15: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

INDIVIDUALS FOLLOW PATTERNS

Getting ready in the morning.Getting to a webpage.

Page 16: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015
Page 17: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

FINDING OFTEN INVOLVES REFINDING Repeat query (33%)user modeling, adaptation, and personalization

Repeat click (39%)http://umap2015.com/Query umap

Lots of repeats (43%)

Repeat Query

33%

New Query 67%

Repeat Click

New Click

Repeat Query

33% 29% 4%

New Query 67% 10% 57%

39% 61%

Page 18: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

IDENTIFYING PERSONAL NAVIGATION Use an individual’s clicks to identify repeat (query, click) pairs15% of the query volumeMost occur fewer than 25 times in the logs

Queries more ambiguousRarely contain a URL fragmentClick entropy the same as for general Web queries Multiple meanings – enquirer Found navigation – bed bugs Serendipitous encounters – etsy

National Enquirer

Cincinnati Enquirer

http://www.medicinenet.com/bed_bugs/article.htm

[Informational]

Etsy.com

Regretsy.com (parody)

95%

Page 19: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

SUPPORTING PERSONAL NAVIGATION

Tom Bosley - Wikipedia, the free encyclopediaThomas Edward "Tom" Bosley (October 1, 1927 October 19, 2010) was an American actor, best known for portraying Howard Cunningham on the long-running ABC sitcom Happy Days. Bosley was born in Chicago, the son of Dora and Benjamin Bosley.

en.wikipedia.org/wiki/tom_bosley

Tom Bosley - Wikipedia, the free encyclopediaBosley died at 4:00 a.m. of heart failure on October 19, 2010, at a hospital near his home in Palm Springs, California. … His agent, Sheryl Abrams, said Bosley had been battling lung cancer.

en.wikipedia.org/wiki/tom_bosley

Page 20: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

PATTERNS A DOUBLE EDGED SWORD

Patterns are predictable.Changing a pattern is confusing.

Page 21: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

CHANGE INTERRUPTS PATTERNS Example: Dynamic menusPut commonly used items at topSlows menu item access

Does search result changeinterfere with refinding?

Page 22: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

CHANGE INTERRUPTS REFINDING When search result ordering changes people are Less likely to click on a repeat result Slower to click on a repeat result when they do More likely to abandon their search

Happens within a query and across sessions

Even happens when the repeat result moves up!

How to reconcile the benefits of change with the interruption?0 4 8 12 16 20

2

5.5

9

Down

Gone

Stay

Up

Time to click S1 (secs)

Tim

e t

o c

lick

S2 (

secs

)

Page 23: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

USE MAGIC TO MINIMIZE INTERRUPTION

Page 24: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

ABRACADABRA Magic happens.

Page 25: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

YOUR CARD IS GONE!

Page 26: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

CONSISTENCY ONLY MATTERS SOMETIMES

Page 27: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

BIAS PERSONALIZATION BY EXPERIENCE

Page 28: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

CREATE CHANGE BLIND WEB EXPERIENCES

Page 29: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

CREATE CHANGE BLIND WEB EXPERIENCES

Page 30: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

THE COMPLEX TASK OF MAKING SEARCH SIMPLE Challenge: The web is complex Tools change, content changes Different people use the web differently

Fortunately, individuals are simple We are predictable, follow patterns Predictability enables personalization

Beware of breaking expectations! Bias personalization by expectations Create magic personal experiences

Page 31: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

REFERENCES

Broder. A taxonomy of web search. SIGIR Forum, 2002 Donato, Bonchi, Chi & Maarek. Do you want to take notes? Identifying research missions in Yahoo! Search Pad. WWW 2010.

Dumais. Task-based search: A search engine perspective. NSF Task-Based Information Search Systems Workshop, 2013.

Jansen, Spink & Saracevic. Real life, real users, and real needs: A study and analysis of user queries on the web. IP&M, 2000.

Kim, Cramer, Teevan & Lagun. Understanding how people interact with web search results that change in real-time using implicit feedback. CIKM 2013.

Lee, Teevan & de la Chica. Characterizing multi-click search behavior and the risks and opportunities of changing results during use. SIGIR 2014.

Mitchell & Shneiderman. Dynamic versus static menus: An exploratory comparison. SIGCHI Bulletin, 1989.

Selberg & Etzioni. On the instability of web search engines. RIAO 2000.

Silverstein, Marais, Henzinger & Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 1999.

Somberg. A comparison of rule-based and positionally constant arrangements of computer menu items. CHI 1986.

Svore, Teevan, Dumais & Kulkarni. Creating temporally dynamic web search snippets. SIGIR 2012.

Teevan. The Re:Search Engine: Simultaneous support for finding and re-finding. UIST 2007.

Teevan. How people recall, recognize and reuse search results. TOIS, 2008.

Teevan, Alvarado, Ackerman & Karger. The perfect search engine is not enough: A study of orienteering behavior in directed search. CHI 2004.

Teevan, Collins-Thompson, White & Dumais. Viewpoint: Slow search. CACM, 2014.

Teevan, Collins-Thompson, White, Dumais & Kim. Slow search: Information retrieval without time constraints. HCIR 2013.

Teevan, Cutrell, Fisher, Drucker, Ramos, Andrés & Hu. Visual snippets: Summarizing web pages for search and revisitation. CHI 2009.

Teevan, Dumais & Horvitz. Potential for personalization. TOCHI, 2010.

Teevan, Dumais & Liebling. To personalize or not to personalize: Modeling queries with variation in user intent. SIGIR 2008.

Teevan, Liebling & Geetha. Understanding and predicting personal navigation. WSDM 2011.

Tyler & Teevan. Large scale query log analysis of re-finding. WSDM 2010.

More at: http://research.microsoft.com/~teevan/publications/

Page 32: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

THANK YOU! Jaime Teevan (@jteevan)[email protected]

Page 33: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

EXTRA SLIDES How search engines can make use of change to improve search.

Page 34: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

CHANGE CAN IDENTIFY IMPORTANT TERMS Divergence from normcookbooksfrightfullymerrymaking ingredient latkes

Staying power in page

Time

Sep. Oct. Nov. Dec.

Page 35: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

CHANGE CAN IDENTIFY IMPORTANT SEGMENTS

Page elements change at different rates

Pages are revisited at different rates

Resonance can serve as a filter for important content

Page 36: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015
Page 37: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015
Page 38: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015
Page 39: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

EXTRA SLIDES Impact of change onrefinding behavior.

Page 40: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

Change to clickUnsatisfied initially

Gone > Down > Stay > Up

Satisfied initially Stay > Down > Up > Gone

Changes around clickAlways benefit NSAT usersBest below the click forsatisfied users

NSAT SAT

Up 2.00 4.65

Stay 2.08 4.78

Down 2.20 4.75

Gone 2.31 4.61

NSAT Changes

Static

Above 2.30 2.21

Below 2.09 1.99

SAT Changes

Static

Above 4.93 4.93

Below 4.79 4.61

BUT CHANGE HELPS WITH FINDING!

Page 41: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

EXTRA SLIDES Privacy issues and behavioral logs.

Page 42: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

PUBLIC SOURCES OF BEHAVIORAL LOGS Public Web service content Twitter, Facebook, Digg, Wikipedia

Research efforts to create logs Lemur Community Query Log Project

http://lemurstudy.cs.umass.edu/ 1 year of data collection = 6 seconds of Google logs

Publicly released private logs DonorsChoose.org

http://developer.donorschoose.org/the-data

Enron corpus, AOL search logs, Netflix ratings

Page 43: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

EXAMPLE: AOL SEARCH DATASET August 4, 2006: Logs released to academic community 3 months, 650 thousand users, 20 million queries Logs contain anonymized User IDs

August 7, 2006: AOL pulled the files, but already mirrored

August 9, 2006: New York Times identified Thelma Arnold “A Face Is Exposed for AOL Searcher No. 4417749” Queries for businesses, services in Lilburn, GA (pop. 11k) Queries for Jarrett Arnold (and others of the Arnold clan) NYT contacted all 14 people in Lilburn with Arnold surname When contacted, Thelma Arnold acknowledged her queries

August 21, 2006: 2 AOL employees fired, CTO resigned

September, 2006: Class action lawsuit filed against AOL

AnonID Query QueryTime ItemRank ClickURL---------- --------- --------------- ------------- ------------1234567 jitp 2006-04-04 18:18:18 1 http://www.jitp.net/1234567 jipt submission process 2006-04-04 18:18:18 3 http://www.jitp.net/m_mscript.php?p=21234567 computational social scinece 2006-04-24 09:19:321234567 computational social science 2006-04-24 09:20:04 2http://socialcomplexity.gmu.edu/phd.php1234567 seattle restaurants 2006-04-24 09:25:50 2http://seattletimes.nwsource.com/rests1234567 perlman montreal 2006-04-24 10:15:14 4http://oldwww.acm.org/perlman/guide.html1234567 jitp 2006 notification 2006-05-20 13:13:13…

Page 44: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

EXAMPLE: AOL SEARCH DATASET Other well known AOL usersUser 927 how to kill your wifeUser 711391 i love alaska

http://www.minimovies.org/documentaires/view/ilovealaska

Anonymous IDs do not make logs anonymousContain directly identifiable information

Names, phone numbers, credit cards, social security numbers

Contain indirectly identifiable information Example: Thelma’s queries Birthdate, gender, zip code identifies 87% of Americans

Page 45: THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan (@jteevan) Microsoft Research UMAP 2015

EXAMPLE: NETFLIX CHALLENGE October 2, 2006: Netflix announces contest Predict people’s ratings for a $1 million dollar prize 100 million ratings, 480k users, 17k movies Very careful with anonymity post-AOL

May 18, 2008: Data de-anonymized Paper published by Narayanan & Shmatikov Uses background knowledge from IMDB Robust to perturbations in data

December 17, 2009: Doe v. Netflix

March 12, 2010: Netflix cancels second competition

Ratings1: [Movie 1 of 17770]12, 3, 2006-04-18 [CustomerID, Rating, Date]1234, 5 , 2003-07-08 [CustomerID, Rating, Date]2468, 1, 2005-11-12 [CustomerID, Rating, Date]…

Movie Titles…10120, 1982, “Bladerunner”17690, 2007, “The Queen”…

All customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy. . . Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one tenth of our complete dataset) and that data was subject to perturbation.