View
213
Download
0
Tags:
Embed Size (px)
Citation preview
1
Automatic Identification of User Goals in Web Search
Uichin Lee, Zhenyu Liu, Junghoo Cho
Computer Science Department, UCLA
{uclee, vicliu, cho}@cs.ucla.edu
2
Motivation
• Users have different goals for Web search
– Reach the homepage of an organization (e.g., UCLA)
– Learn about a topic (e.g., simulated annealing)
– Download online music, etc.
• Can we identify the user goal for a Web search
automatically?
– Improve and customize search results based on the
identified user goal, for example
3
Two high-level user-goals
• Navigational query
– Reach a Web site the user already has in mind (e.g.,
“UCLA Library”)
• Informational query
– Visit multiple sites to learn about a particular topic (e.g.
“Simulated Annealing”)
• Based on [Broder02, Rose&Levinson04]
– Navigational and informational are common in both studies
4
Exploiting identified user goals
• Tailored weighting/ranking mechanism– Navigational queries
• Emphasize on anchor texts [Craswell01, Kang03], URL path [Westerveld01]
– Informational queries• Emphasize on page content [Kang03], IR techniques (query
expansion, relevance feedback, pseudo relevance feedback, etc.)
• Tailored result presentation– Informational queries
• Clustered search results [Etzioni99, Zeng04, Kummamuru04]
• Targeted ads / answers
5
Outline
• Are query goals predictable?
– Human-subject study
• How can we predict user goals automatically?
– Anchor-link distribution
– User-click distribution
• How effective are our features?
– Experimental evaluation
6
Are query goals “predictable”?
• Search engines “see” only a few keywords
– No explicit indication of goals by users
– Can we predict the user goal simply from the keywords?
• Human subject study
– 50 most popular Google queries from UCLA CS
– 28 participants (grad students) from UCLA CS
– Ask subjects to indicate the likely goal of each query if
they had issued it
• Do most subjects agree on a particular goal?
7
Human subject study results
• i(q) – the % of participants
that judge query q as
informational
– e.g., i(q) = 0.038 for
“UCLA Library”
0
1
2
3
4
5
6
7
8
9
[0, 0
.1)
[0.1
, 0.2
)
[0.2
, 0.3
)
[0.3
, 0.4
)
[0.4
, 0.5
)
[0.5
,0.6
)
[0.6
, 0.7
)
[0.7
, 0.8
)
[0.8
, 0.9
)
[0.9
, 1]
i (q )
# of
qu
erie
s
Queries with a predictable goal
8
Human subject study results
• i(q) – the % of participants
that judge query q as
informational
– e.g., i(q) = 0.038 for
“UCLA Library”
0
1
2
3
4
5
6
7
8
9
[0, 0
.1)
[0.1
, 0.2
)
[0.2
, 0.3
)
[0.3
, 0.4
)
[0.4
, 0.5
)
[0.5
,0.6
)
[0.6
, 0.7
)
[0.7
, 0.8
)
[0.8
, 0.9
)
[0.9
, 1]
i (q )
# of
qu
erie
s
“ambiguous queries”
43.5% software names30.4% person names
9
Human subject study results
0
1
2
3
4
5
6
7
8
9
[0, 0
.1)
[0.1
, 0.2
)
[0.2
, 0.3
)
[0.3
, 0.4
)
[0.4
, 0.5
)
[0.5
,0.6
)
[0.6
, 0.7
)
[0.7
, 0.8
)
[0.8
, 0.9
)
[0.9
, 1]
i (q )
# of
qu
erie
s
• i(q) – the % of participants
that judge query q as
informational
– e.g., i(q) = 0.038 for
“UCLA Library”
• After removing software and
person-name queries
10
Human subject study: summary
• Majority of queries have predictable goals
• Interestingly, most ambiguous queries tend to be on a
certain set of topics
– Topic-based ambiguity detection may be possible
– Treat ambiguous queries differently from others
11
Outline
• Are query goals predictable?
– Human-subject study
• How can we predict user goals automatically?
• How effective are our features?
– Experimental evaluation
12
How to predict user goal?
• “UCLA Library” vs. “Simulated Annealing”
– Navigational vs. informational
– Semantic analysis necessary?
• Our idea: use information provided implicitly by Web
users
– Web-link structure
– User-click behavior
13
Web-link structure
• Anchor-link distribution to quantify the link structure
www.library.ucla.edu
UCLA Library
UCLA Library
UCLA Library
UCLA Library
UCLA Library
UCLA Library
UCLA Library
UCLA Library
UCLA Library
UCLA Library
www.ucla.edu/library.html
repositories.cdlib.org/uclalib/
14
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
anchor link rank
freq
uen
cy f
or e
ach
lin
k de
tin
atio
n
Web-link structure
• Anchor-link distribution to quantify the link structure
www.library.ucla.edu
www.ucla.edu/library.html
repositories.cdlib.org/uclalib/
Anchor-link distribution for query: “UCLA Library”
15
Anchor-link distribution for sample queries
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
anchor link rank
freq
uen
cy f
or e
ach
lin
k
det
inat
ion
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
anchor link rank
freq
uen
cy o
f ea
ch l
ink
d
esti
nat
ion
Navigational Informational
“UCLA Library” “Simulated Annealing”
16
User-click behavior
• Click distribution to
quantify past user-
click behavior
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10answer rank
clic
k f
req
uen
cy
Click distribution for the navigational query: “UCLA Library”
17
User-click behavior (cont’d)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10answer rank
clic
k f
req
uen
cy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
answer rank
clic
k f
req
uen
cy
Navigational Informational
“UCLA Library” “Simulated Annealing”
18
Capturing the “shape” of distributions
• Possible numeric features for f(x)
– Mean –
– Median
– Skewness – (x - )3f(x)dx / 3
• How “asymmetric” f(x) is
– Kurtosis – (x - )4f(x)dx / 4
• How “peaked” f(x) is
• Single linear regression
– Median is the most effective measurement for both anchor-
link distribution and click distribution
19
Evaluation of features
• Based on 30 queries from the human subject study
– Except software and person-name queries
– Each query is associated with a distinct user goal
• Anchor-link distribution for each query
– Based on 60M pages crawled from the Web
• Click distribution for each query
– Based on Google-result click behavior from UCLA CS
during April 2004 - September 2004
20
Goal-prediction graph (synthetic)
0.0 0.2 0.4 0.6 0.8 1.0
i (q )
an in
divi
dual
fea
ture
An effective feature (hypothetically)
navigational informational
21
Prediction graph: median of anchor-link dist.
0.0
1.0
2.0
3.0
4.0
0.0 0.2 0.4 0.6 0.8 1.0
i (q )
med
ian
of a
ncho
r-li
nk
dist
ribu
tion
1 = 1.0
• Navigational iff median < 1 = 1.0
– Navigational queries: the vast majority of links point to the#1 anchor destination
• Prediction accuracy: 80.0%navigational informational
22
Prediction graph: combining the two features
• Linear combination with
equal weights:
Navigational queries iff
the median of click dist. +
the median of anchor-link dist.
< 1 + 2 (= 2.0)
• Prediction accuracy: 90%0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
0.0 0.2 0.4 0.6 0.8 1.0
i (q )
med
ian
of c
lick
dis
trib
utio
n +
med
ian
of a
ncho
r-li
nk d
istr
ibut
ion
1+2 = 2.0
navigational informational
23
• Three features in [Kang and Kim 03]
(1) Anchor usage rate
(2) Query term distribution
(3) Term-dependence
Comparison with previous work
-2.0-1.00.01.02.03.04.05.06.07.08.0
0.0 0.2 0.4 0.6 0.8 1.0
i (q )
quer
y te
rm d
istr
ibut
ion navigational informational
• Result
– Could not reproduce reported results
– Three features not very effective
24
Summary
• Two effective features for goal identification
– Anchor-link distribution (Web-link structure) and click
distribution (user-click behavior)
– Achieved an overall accuracy of 90% on a benchmark
query set
• More details in the paper
25
Future work
• Evaluate on a larger and less biased query set
• Handle queries with insufficient anchor/click statistics
– Learn patterns from queries whose goals are clear
• Predict search intentions on a finer granularity
– Informational queries can be further classified, e.g., directed,
undirected, advice, list, etc. [Rose04]
– Analyze the contents of Web pages that users have
clicked/viewed
– Linguistic methods
27
Questionnaire design
• 1st version: direct classification by subjects
– Navigational vs. informational
– Some confusion
• “Alan Kay”: home page + other pages
• “Have a site in mind?” vs “plan to visit one site?”
• 2nd version:
1. Have a site in mind. Intend to visit only that site
2. Have a site in mind. But willing to visit others
3. Have no site in mind. Willing to visit anything relevant