Upload
peiling-wang
View
68
Download
0
Tags:
Embed Size (px)
Citation preview
Web Search Log Analysis and User Web Search Log Analysis and User Behavior Modeling: A Tutorial Behavior Modeling: A Tutorial
ACM 17th Conference on Information and Knowledge Management
Napa Valley, CAOctober 26, 2008
Peiling Wang and Lei Wu (UT)
Dietmar Wolfram (UMW)
Successful FundingSuccessful Funding
UT Interdisciplinary Research Grant $14,600 (2001)
UT Research Grant $5,000 (2003) OCLC/ALISE Research Grant Award
$15,000 (2005) IMLS National Leadership on Research
$200,000 (2005-2008)
PART I. PART I. DATA MODELDATA MODEL& & WEB SEARCH BEHAVIOR WEB SEARCH BEHAVIOR MODELING MODELING
Three Query CorporaThree Query Corpora
Academic site:
4,597,478 queries (10/2002—01/2005)
[example log files]
Health information site:
377,701 queries (2005)
Excite search engine:
435,627 (1999); 450,199 (2001)
Server and Web Engine LogsServer and Web Engine LogsACCESS.LOG
IPNum Date Time Query&Sites Machine
207.203.188.xx [06/Aug/2003:10:35:42 -0400] "GET query.html?col=utc&col=utia&…&tsi
&qt=where+is+tca+law+on+release+of+lien%…” "Mozilla/4.0 (compatible; MSIE 6.0;
Windows 9”
QUERY.LOG
Date Time Hits Sites Query
2003/08/06 10:35:42 481856 utc,utia,…,tsi u'where is tca law on release of lien?'
CLICK.LOG
Date Time Action Query Sites URL Rank
2003/08/06 10:36:33 click u’where is tca law on release of lien?’ utk,utia, …tsi
http://web.utk.edu/~ereagan1/TCA Final Exam Notes.doc 3
Methodological NotesMethodological Notes
natural dataanalogy to astronomer’s worknew hypotheses along the wayknowledge discovery through
mining dataa bottom up approach (requires a
good data model)
Logical data model (Relational)Logical data model (Relational)
Several models used in query analysis: Question-oriented
Wolfram (2006) Baeza-Yates (2006)
Data-driven Jansen (2005) Wang, Berry & Yang (2003)
Granularity (low - high)
Click *QID
UID
Year
Month
Day
Time
TimeS
Rank
Query *QID
Year
Month
Day
Time
TimeS
Hit
NumSite
IP
query_raw
groupID
QID_uniq
Query_ Token_ uniqQID_uniq
String
Position
Query_ uniqQID_uniq
query
NumWord
NumChar
Freq_query_raw
Token_ uniqString
Length
Freq_query
Freq_word
WebPageUID
URL
Freq
Figure 1 Data Model
Lexicontool
sense *word
meaningID
word *word
num_meaningID
synset *meaningID
definition
morphrefword
morph
pos
Stop_wordS_Word
WordNet implemented in Relational Database
Figure 2 Lexicon Tool
Modeling behaviorsModeling behaviors
Corpora-based (site users as a whole) popular queries information needs document access (clicked URLs) query characteristics
words co-occurrence
Session-based (individual searches) interactions (reiteration of queries …) search topics clustering sessions to identify patterns
Top QueriesTop Queries
UTK HealthLink Excite
BlackBoard 74316 urinary+tract+infection 3947 yahoo 2523
Enter search terms 45564 pregnancy 2788 sex 2258
housing 45270 breast+cancer 2443 horoscopes 1249
circle park 32552 diet 2297 hotmail 1121
registrar 28284 interstitial+cystitis 2271 maps 1100
tuition 26312 2144 weather 963
career services 21327 blood+clots 1952 games 943
bookstore 20507 breast+self-exam 1867 ebay 918
timetable 20207 shingles 1844 porn 861
transcripts 19436 bulemia 1740 las+vegas 840
Enter search terms
Go
A problem identified
Seasonal Information NeedsSeasonal Information Needs
"football" Related Query Distribution
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0 2 4 6 8 10 12 14
Per
cen
tag
e
2003
2004
Month
Figure 3
Top 10 WordsTop 10 Words
UTK HealthLink Excite
of 54669 of 4629 and 97967
and 35547 and 3096 of 26434
for 26849 in 2813 the 17711
the 18986 the 2058 in 15368
in 16900 for 1893 free 15136
student 15604 disease 1706 for 10411
ut 14201 cancer 1315 to 7024
2003 12975 blood 1259 pictures 7000
2004 12683 to 1255 new 6991
to 11410 syndrome 1235 nude 5951
What can a search session tell?What can a search session tell?
Information needs (queries) representation of such needs cognitive (knowledge structure) topics searched linguistic features
Interactions (http://aquamarine.sis.utk.edu/top/session_ex.htm)
moves from an initial query to subsequent queries clicks (viewing results or accessing documents) search strategies
Information Needs and Knowledge Information Needs and Knowledge StructureStructure
2004 University of Tennessee Football Schedule
Figure 4. Drawing a Semantic Network of A Single Search Session
2004
football
Schedule
game
9
6
7
3
5
UT
1
11
Building Semantic NetworksBuilding Semantic Networks
Corpus-based or session-based Word co-occurrence for links Use an algorithm to set a threshold for
the boundaries of the semantic network The threshold may vary for different
clusters
PART II. PART II. QUANTITATIVE CLUSTERINGQUANTITATIVE CLUSTERING& & QUALITATIVE CLUSTERINGQUALITATIVE CLUSTERING
Challenge: Identifying Web Search Challenge: Identifying Web Search SessionsSessions
Server-side transaction logs
Queries from different IPs interweaving
Dynamic IP address
Shared computers
Session boundaries are unidentifiable
Sessions for analysis of interactions
Defining Search SessionsDefining Search Sessions
An artificial boundary
A set of consecutive queries submitted from the same identifier (IP address, cookie, user account) within a reasonable time interval (cutoff value)
Session boundaries dependent of cutoff value (threshold)
What is a reasonable cutoff?
Experimenting Experimenting CutoffCutoff
Query interval (∆ti) is the time difference (also called time lag) between two consecutive queries from the same identifier:
queryith theof timestamp theis )T(q where),T(q– )T(q
0
ii1iit
Experimenting Different Cutoffs (Healthlink dataset)
Figure 5
Session variables (Means)Session variables (Means)
Length (size): Number of queries Query length: Number of terms Term popularity: corpus-based term F Query interval: timelag between two
consecutive queries (Duration: timelag between first and last
queries) Term reuse: session-based term f
Figure 6 Visualize Clusters
Figure 6 interpretedFigure 6 interpreted
C1 “hit and run”: brief sessions, short query intervals, few terms, less popular terms
C2 “focused search”: long queries; popular vocabulary
C3 “struggling search”: long sessions, long query intervals, re-use of terms in subsequent queries
Clustering 2-steps MethodClustering 2-steps Method
1. Session variables (see above) export from database as delimited file. Each session is represented as a record.
2. Raw data is imported to SPSS
3. TwoStep cluster analysis(standardize data)
Session Raw Data and Session Raw Data and NormalizationNormalization
Clusters validationClusters validation
Divide each dataset into two or more subsets of sessions to determine if similar clustering outcomes are produced
Longitudinal samples:Each academic year consists of three quarters: FallSpringSummer
Beyond Quantitative Clustering: Beyond Quantitative Clustering: Conceptual AnalysisConceptual Analysis
Conceptual level synonyms association (mutual information)
Semantic level hyper-hypo relationship
May also include structure level
Clustering User QueriesClustering User Queries
different queries may represent the same or similar information needs
a set of queries may look for the same information
clustering based on similarity (distance) word level (symbolic, morph) concept level (synonym, association) semantic level (hierarchical relationship)
Similarity scoresSimilarity scores
Simw = ∑ TokenScore / Max(Q1(UniqueWords), Q2(UniqueWords))
Simc = ∑ ConceptScore / Max(Q1(UniqueWords), Q2(UniqueWords))
Sims = ∑ SemanticScore / Max(Q1(UniqueWords), Q2(UniqueWords))
Wq1 Wq2 Token Concept Semantic
class course 0 1 1
class timetable 0 0 0
schedule course 0 0 0
schedule timetable 0 0 0.5
0 0.5 0.75
Similarity at Three Levels Similarity at Three Levels
Q1 class schedule Q2 timetable of courses
at conceptual level, we may use synonym as well as normalized word association value, thus the three pairs (class, timetable; schedule, course; schedule timetable) may not be scored to “0”
WordNet and beyondWordNet and beyond
A useful tool with limitations Expansion of vocabulary is needed to
include local vocabulary Hierarchical relationship needs
improvement Incorporate associative relationship
Thank you!Thank you!
Questions?