1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu, cho}@cs.ucla.edu

1

Automatic Identification of User Goals in Web Search

Uichin Lee, Zhenyu Liu, Junghoo Cho

Computer Science Department, UCLA

{uclee, vicliu, cho}@cs.ucla.edu

2

Motivation

• Users have different goals for Web search

– Reach the homepage of an organization (e.g., UCLA)

– Learn about a topic (e.g., simulated annealing)

– Download online music, etc.

• Can we identify the user goal for a Web search

automatically?

– Improve and customize search results based on the

identified user goal, for example

3

Two high-level user-goals

• Navigational query

– Reach a Web site the user already has in mind (e.g.,

“UCLA Library”)

• Informational query

– Visit multiple sites to learn about a particular topic (e.g.

“Simulated Annealing”)

• Based on [Broder02, Rose&Levinson04]

– Navigational and informational are common in both studies

4

Exploiting identified user goals

• Tailored weighting/ranking mechanism– Navigational queries

• Emphasize on anchor texts [Craswell01, Kang03], URL path [Westerveld01]

– Informational queries• Emphasize on page content [Kang03], IR techniques (query

expansion, relevance feedback, pseudo relevance feedback, etc.)

• Tailored result presentation– Informational queries

• Clustered search results [Etzioni99, Zeng04, Kummamuru04]

• Targeted ads / answers

5

Outline

• Are query goals predictable?

– Human-subject study

• How can we predict user goals automatically?

– Anchor-link distribution

– User-click distribution

• How effective are our features?

– Experimental evaluation

6

Are query goals “predictable”?

• Search engines “see” only a few keywords

– No explicit indication of goals by users

– Can we predict the user goal simply from the keywords?

• Human subject study

– 50 most popular Google queries from UCLA CS

– 28 participants (grad students) from UCLA CS

– Ask subjects to indicate the likely goal of each query if

they had issued it

• Do most subjects agree on a particular goal?

7

Human subject study results

• i(q) – the % of participants

that judge query q as

informational

– e.g., i(q) = 0.038 for

“UCLA Library”

0

1

2

3

4

5

6

7

8

9

[0, 0

.1)

[0.1

, 0.2

)

[0.2

, 0.3

)

[0.3

, 0.4

)

[0.4

, 0.5

)

[0.5

,0.6

)

[0.6

, 0.7

)

[0.7

, 0.8

)

[0.8

, 0.9

)

[0.9

, 1]

i (q )

# of

qu

erie

s

Queries with a predictable goal

8




informational

– e.g., i(q) = 0.038 for

“UCLA Library”

0

1

2

3

4

5

6

7

8

9

[0, 0

.1)

[0.1

, 0.2

)

[0.2

, 0.3

)

[0.3

, 0.4

)

[0.4

, 0.5

)

[0.5

,0.6

)

[0.6

, 0.7

)

[0.7

, 0.8

)

[0.8

, 0.9

)

[0.9

, 1]

i (q )

# of

qu

erie

s

“ambiguous queries”

43.5% software names30.4% person names

9


0

1

2

3

4

5

6

7

8

9

[0, 0

.1)

[0.1

, 0.2

)

[0.2

, 0.3

)

[0.3

, 0.4

)

[0.4

, 0.5

)

[0.5

,0.6

)

[0.6

, 0.7

)

[0.7

, 0.8

)

[0.8

, 0.9

)

[0.9

, 1]

i (q )

# of

qu

erie

s



informational

– e.g., i(q) = 0.038 for

“UCLA Library”

• After removing software and

person-name queries

10

Human subject study: summary

• Majority of queries have predictable goals

• Interestingly, most ambiguous queries tend to be on a

certain set of topics

– Topic-based ambiguity detection may be possible

– Treat ambiguous queries differently from others

11

Outline

• Are query goals predictable?

– Human-subject study

• How can we predict user goals automatically?

• How effective are our features?

– Experimental evaluation

12

How to predict user goal?

• “UCLA Library” vs. “Simulated Annealing”

– Navigational vs. informational

– Semantic analysis necessary?

• Our idea: use information provided implicitly by Web

users

– Web-link structure

– User-click behavior

13

Web-link structure

• Anchor-link distribution to quantify the link structure

www.library.ucla.edu

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

www.ucla.edu/library.html

repositories.cdlib.org/uclalib/

14

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

anchor link rank

freq

uen

cy f

or e

ach

lin

k de

tin

atio

n

Web-link structure

• Anchor-link distribution to quantify the link structure

www.library.ucla.edu

www.ucla.edu/library.html

repositories.cdlib.org/uclalib/

Anchor-link distribution for query: “UCLA Library”

15

Anchor-link distribution for sample queries

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

anchor link rank

freq

uen

cy f

or e

ach

lin

k

det

inat

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

anchor link rank

freq

uen

cy o

f ea

ch l

ink

d

esti

nat

ion

Navigational Informational

“UCLA Library” “Simulated Annealing”

16

User-click behavior

• Click distribution to

quantify past user-

click behavior

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10answer rank

clic

k f

req

uen

cy

Click distribution for the navigational query: “UCLA Library”

17

User-click behavior (cont’d)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10answer rank

clic

k f

req

uen

cy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

answer rank

clic

k f

req

uen

cy

Navigational Informational

“UCLA Library” “Simulated Annealing”

18

Capturing the “shape” of distributions

• Possible numeric features for f(x)

– Mean –

– Median

– Skewness – (x - )3f(x)dx / 3

• How “asymmetric” f(x) is

– Kurtosis – (x - )4f(x)dx / 4

• How “peaked” f(x) is

• Single linear regression

– Median is the most effective measurement for both anchor-

link distribution and click distribution

19

Evaluation of features

• Based on 30 queries from the human subject study

– Except software and person-name queries

– Each query is associated with a distinct user goal

• Anchor-link distribution for each query

– Based on 60M pages crawled from the Web

• Click distribution for each query

– Based on Google-result click behavior from UCLA CS

during April 2004 - September 2004

20

Goal-prediction graph (synthetic)

0.0 0.2 0.4 0.6 0.8 1.0

i (q )

an in

divi

dual

fea

ture

An effective feature (hypothetically)

navigational informational

21

Prediction graph: median of anchor-link dist.

0.0

1.0

2.0

3.0

4.0

0.0 0.2 0.4 0.6 0.8 1.0

i (q )

med

ian

of a

ncho

r-li

nk

dist

ribu

tion

1 = 1.0

• Navigational iff median < 1 = 1.0

– Navigational queries: the vast majority of links point to the#1 anchor destination

• Prediction accuracy: 80.0%navigational informational

22

Prediction graph: combining the two features

• Linear combination with

equal weights:

Navigational queries iff

the median of click dist. +

the median of anchor-link dist.

< 1 + 2 (= 2.0)

• Prediction accuracy: 90%0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

0.0 0.2 0.4 0.6 0.8 1.0

i (q )

med

ian

of c

lick

dis

trib

utio

n +

med

ian

of a

ncho

r-li

nk d

istr

ibut

ion

1+2 = 2.0

navigational informational

23

• Three features in [Kang and Kim 03]

(1) Anchor usage rate

(2) Query term distribution

(3) Term-dependence

Comparison with previous work

-2.0-1.00.01.02.03.04.05.06.07.08.0

0.0 0.2 0.4 0.6 0.8 1.0

i (q )

quer

y te

rm d

istr

ibut

ion navigational informational

• Result

– Could not reproduce reported results

– Three features not very effective

24

Summary

• Two effective features for goal identification

– Anchor-link distribution (Web-link structure) and click

distribution (user-click behavior)

– Achieved an overall accuracy of 90% on a benchmark

query set

• More details in the paper

25

Future work

• Evaluate on a larger and less biased query set

• Handle queries with insufficient anchor/click statistics

– Learn patterns from queries whose goals are clear

• Predict search intentions on a finer granularity

– Informational queries can be further classified, e.g., directed,

undirected, advice, list, etc. [Rose04]

– Analyze the contents of Web pages that users have

clicked/viewed

– Linguistic methods

26

Thank you

• Any questions?

27

Questionnaire design

• 1st version: direct classification by subjects

– Navigational vs. informational

– Some confusion

• “Alan Kay”: home page + other pages

• “Have a site in mind?” vs “plan to visit one site?”

• 2nd version:

1. Have a site in mind. Intend to visit only that site

2. Have a site in mind. But willing to visit others

3. Have no site in mind. Willing to visit anything relevant

Documents

1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu, cho}@cs.ucla.edu