Upload
haven-coverdale
View
218
Download
4
Tags:
Embed Size (px)
Citation preview
Web Behavior Analysis
Your Last Words? (in 22nd century)
• To family• To your best friend?
Web Behavior Analysis
• Why important?• Why scary?
Part I: Why Important?
Q. In the past six months have you used a search engine to help inform your decisions for the following tasks?
66%of people are using search
more frequently to make
decisions
• We rely more and more on search for our real-life decision– Opportunities for
business– Concerns for
privacy
Length of Sessions by Type
What should be done?
• Focus on new territory
Taxonomy of Web queries
• Navigational (we are good at this)– to reach a particular site
• E.g., Searching for top page of company
• Informational– to acquire pages that provide
knowledge for user’s information need• Conventional ad hoc retrieval
• Transactional– to perform a Web-mediated activity
• E.g., online shopping
Navigational Queries Pseudo- Navigational Queries
Example: Good and Bad
• Car GPS around $300• Four day trip to Bhutan from Delhi to
visit important Buddhist places
Example of “Hard Queries”:Informational/Transactional
Game Consol
es
Party Site
What we want?
Current research directions
• How to classify queries?• Then what?
– Search engines trying to reduce clicks for “hard queries”
– Extracting info from forum
Importance of query classification: “obama”
• Informational: People may search to know more about Barak Obama
• Navigational: visit his official website • Transactional: perhaps the user goal
is to donate money online to support Mr. Obama’s campaign
Yahoo numbers
• ~25 informational content text?• ~40 navigational anchor text?• ~35 transactional site template?
Can you tell if query is “navigational” or not?
Lee et al.[WWW05]: Overview
• Analyzing how query term is used in anchor texts
WWW2008 WWW2008search search
Top page ofWWW2008
Description in Wikipedia
Search engine
Destinations are identical → Navigational
Destinations are diverse → Informational
Q = “search” Q = “WWW2008”
Anchor-link distribution (ALD)
Probability that page linked by t is d
Top page of WWW2008
t = WWW2008
ALD is skewed
)|( tdP
Google Yahoo!Wikipedia
t = search
ALD is uniform
NavigationalInformational
)|( tdP
Lee et al.: Problem• Targeting only anchor texts that are
exactly same as the query– If the same anchor text as the query
does not exist on the Web, ALD cannot be computed
• Problematic queries– Long phrase
• E.g., “information retrieval system research”
– Multiple keywords• E.g., “trec, nist, test collection”
Multi-query solutionQuery Q = “trec, test collection”
t = trec t = test t = collection)|( tdP
Terms T = {trec, test, collection}
destinations D = {d1, d2, …}
Compute ALD on a term-by-term basis and integrate them
Computation of classification score
• Entropy of D
Tt Dd
tdPtdPtPTDH )|(log)|()()|(
Entropy of a single term tWeighted average
Now what?
• For “WWII”– Google: http://www.google.com/search?q=WWII&hl=
en&tbo=1&output=search&tbs=ww:1 – Microsoft: http://
www.bing.com/reference/semhtml/World_War_II?fwd=1&qpvt=wwii&src=abop&q=wwii
– Wolfram: http://www.wolframalpha.com/input/?i=wwII • Can you tell information vs. transactional?
Challenges/Opportunities
• Slightly subtle/interleaved• But huge advertisement revenue (yet to be
explored)!!!!• Classic querylog+Clicks on surface web no
t enough..• Any ideas?
More signals?
• Eye movement? • Brain signal?
More corpus? (social corpus for polls? expert advice?)
More signal
CS: Client Simple
• First representation:– Trajectory length– Horizontal range– Vertical range
Horizontal range
Vertical range
Trajectory length
CF: Client Full
• Second representation: – 5 segments:
initial, early, middle, late, and end
– Each segment: speed, acceleration, rotation, slope, etc.
1
2
3
4
5
Navigational query: “facebook”
Informational query: “spanish wine”
Transactional query: “integrator”
More corpus
• cQA successful, as “additional corpus”, not as “additional means”
• Challenges?
cQA (Yahoo Answers)
How Yahoo Answers works
Good questions draw good answers
Good Q/A? -- Text
Check also: http://www.addedbytes.com/code/readability-score/
Good Q/A? -- Clicks
Good Q/As? -- Community
Why scary?
Useful beyond imagination
• Spell checker: SIGMOD Did you mean “sigmoid”?
• Entity relation: SIGMOD ~ SIGIR• Translation: SIGMOD, 씨그모드 sigmo
d.com• Query suggestion: 영일대 호텔 영일대• Rank learning: top 10 entry is visited all th
e time, what should we do?• Reason of migrain?
Companies need YOUR HELP
• AOL released logs• Guess what happened?
More scientific observations (Yahoo Research)
• X={query1, query2, query3}• Y= age
gender area
XY (how likely?) Validate with ground-truth info (Yahoo
account)
See if you can do it?
• You observe yourself:
http://aolpsycho.com/user/5826-kallemeyn
Gender
• Female: fanfiction, bridal, makeup, women’s, knitting, hair, ecards, glitter, yoga, and diet
• Male: nfl, poker, espn, ufc, railroad, prostate, footb
all, golf, male, wrestling, compusa, as well as a variety of adult terms
Accuracy: 80+%
Age
• YOUNG: myspace, pregnancy, wikipedia, lyrics, quotes, apartments, torrent, baby, wedding, mall, soundtrack;
• OLD: aarp, telephone, lottery, amazon.com, retirement, funeral, senior, mapquest, medicare, newspapers, repair.
Place
• A user’s zip code (US postal code) or other identifier of location may be detectable from place names used in
• Check out YahooGEO Apis
Name?
• 50+% issued their name• (but other names too)
Ref: "Vanity Fair: Privacy in Querylog Bundles"
User Solutions?
TrackMeNot (TMN) Their tool is an extensio
n to the Firefox web browser, and initiates randomized search queries in the background to a number of commercial search engines.
• Tor: change IP/cookie (prevents aggregation)
- Losing services e.g., personalization
Company Solution
• K-anonymity (bundling)Reported to be unsafe for (vanity
search + geo-query, long-tail keywords)
[so far, it is considered to be TOO RISKY]
Summary
• You are leaving trails in the cyber world, which aligns more and more with real-life trails
• Companies are interested in predicting as much as possible of your next behaviors
• More signals? More corpus?• Can you hide as much to protect privacy,
while reveal as much to enable such prediction? (privacy dilemma)
• But it is ok even if we can’t know (product state-of-the-art)
Search UI? Visualization?
What are query aspects?
Challenge
• Intentions are hidden– omission of key information makes intent in q
ueries ambiguous– eg: omitting “reviews” when searching for revie
ws of “Canon EOS 40D SLR”– eg: omitting “location/city” when searching for
“jobs”• Queries are often too broad
Goal
• Mine broad latent aspects from search logs– Formulate the problem based on a real-world m
odel of user interaction with search engine (session = 10 mins)
– Bring interesting aspects to the attention of editors who can then determine saliency and usefulness
User reformulates query by adding
qualifier “reviews”
User reformulates query by selecting “reviews” aspect
User interaction modelUser enters
original query “Canon EOS 40D”
Search engine (SE) returns general
results
SE returns reviews of the camera
User’s query is satisfied. eg: she clicks
on a result.
Search engine (SE) returns general results + query
aspects
Learning of query aspectsfrom reformulations
Results: Examples of aspects found
New directions might be
• Taking target web page clicked into account while constructing aspects
• Or visualization techniques helping to visually/perceptually/cognitively mine such “aspects”– Visualization/refinement iterations to narrow down
Tomorrow 4:15pm (B2 102)Title:
Using Information Visualization to Understand Data Abstract:
Information Visualization is the art and science of representing abstract information in a visual form that enables users to understand data through their perceptual and cognitive capabilities.
Dr. Bongshin Lee (Microsoft Research)