Upload
maharashtra
View
0
Download
0
Embed Size (px)
Citation preview
Tree-Based Ontological User Profiling for Web
Search Optimization
Sachin R. Joglekar1 and Dr. Mangesh Bedekar
2
1BITS-Pilani, K. K. Birla Goa Campus, Goa, India
Email: [email protected] 2Maharashtra Institute of Technology, Kothrud, Pune, India
Email: [email protected]
Abstract--Today, the internet represents the most
important source of knowledge for the average user.
As the number of users turning to search engines for
information is increasing, web search personalization
is becoming an important domain of research in
information retrieval. It is essential to adapt the
process of web search according to the needs of every
individual, without explicit actions from the user's
side. In this paper, we demonstrate an implicit client-
side method for user profiling using an ontological
tree, for efficient contextualization of a user's
interests. Our developed system is able to build 'part-
profiles' of a user over time, given his web usage data.
Each of these part-profiles pertains to one domain of
interest of the said user. We also demonstrate the
usage of these profiles for search result re-ranking by
exploiting a tree-traversing algorithm, to ensure
faster knowledge gain from the user's perspective.
The methods described in this paper can also be used
for other methods of web search optimization, such as
query expansion.
Index terms–User Profiling, Web Search, Ontology
I. INTRODUCTION
A knowledge worker can be defined as someone
“whose paid work involves significant time spent in
gathering, finding, analyzing, creating, producing or
archiving information” [13]. Essentially speaking, every
web user in today's scenario is a knowledge worker,
using search engines extensively to obtain required
knowledge off the web. Considering the ever-increasing
amount of text data being added to the internet, it is
important to make it easier for the user to get to the
information he wants, as efficiently as possible. With this
aim in mind, the one-size-fits-all approach [14] is no
longer recommended, especially considering the
polysemous nature of words in the English vocabulary.
The need to model the web experience of an individual as
per his specific interests is more important than ever.
The best possible way to understand what a user needs
from his web experience is relevance feedback for web
pages from the user himself [4]. However, a user is
usually reluctant to take the extra effort needed to
provide such information on a regular basis explicitly
[15]. Hence, implicit user modeling is required, to avoid
the extra hassle from the user's side. Moreover, fast
processing of a user's interests will only be possible if it
is achieved at the individual client-side, helping
distribute the necessary computations instead of
burdening the server-side infrastructure.
To achieve the aim of implicit client-side search
optimization, we take the help of the DMOZ
(directory.mozilla.org) ontology. Ontology is defined as
“an explicit specification of a conceptualization- the
objects, concepts, and other entities that are assumed to
exist in some area of interest and the relationships that
hold among them” [6, 16]. With the idea of accurately
representing a user's information needs and search
interests, we propose a system to implicitly build an
ontological tree of 'part-profiles'. The ontology used for
this purpose is the DMOZ/ODP (Open Directory Project)
ontology [1]. Every part-profile, as constructed by this
method, would denote an area of the user's interest and
help in optimizing web search pertaining to it.
The aforementioned tree of ontological part profiles
will be updated dynamically, as the user visits various
web pages. Such dynamic generation of user profiles has
been explored in the UCAIR system [3]. However,
UCAIR requires the definition of 'logical sessions' of a
user's web activities, where every distinct session would
pertain to a unique profile. Our method overcomes this
shortcoming by extracting important keywords from
every web page viewed by a user, and exploiting the
ODP hierarchy to understand the field of study of a web
search [12]. The keywords extracted from a certain web
page form a weighted vector in the bag-of-words
representation of the page, which is then categorized into
one of the user's profiles. Ref. [17] explored a similar
method for ontological categorization of visited pages.
We improve upon their method by using a tree-based
algorithm over the profile tree for fast categorization.
This avoids the need to compare a certain page vector
with every category in the tree. When the similarity of a
page vector to its predicted category falls below a pre-
defined threshold, a new part-profile is initialized for the
user.
Personalization of web search is attained by
constructing an expanded version of the user's original
query vector using the summaries of the top search
results from a search engine. This vector is then either
classified to a part profile of the user (if the similarity is
above the threshold) or added to a newly generated
profile based on the DMOZ ontology. Then, result re-
ranking is done on the top search results [18] by
computing their final rank based on an algorithm inspired
by [11].
The remaining sections are organized as follows. In
Section 2, we discuss the related work. In Section 3, we
present our complete methodology for dynamic user
profiling using the ODP ontology tree. In Section 4, we
demonstrate the experimental results obtained from
implementing the framework described in Section 3. In
Section 5, we present discussions and critique points
pertaining to our work. Section 6 concludes the
discussion of our research.
II. RELATED WORK
Ref. [2] very appropriately defines the two pillars of
personalized search- contextualization and
individualization. Contextualization refers to the
definition and representation of the background
information regarding a user's work and the nature of his
search interests. Individualization refers to the
distinguishing factors and data pertaining to user's own
unique information needs. Usage of an ontology for user
profiling satisfies both these requirements - the extensive
vocabulary of DMOZ aids adequate representation of a
user's background (contextualization), while the
hierarchy of domains in the hierarchical tree helps in
accurate descriptions of the topics of interest
(individualization). Ref. [9] recognized the four main
types of context in web applications- domain, location,
data and user. With the help of our system, we exploit
data context (mining the ODP wealth of knowledge for
content) and user context (focusing on the web pages
viewed by the user). Ref. [2] uses the ODP ontology to
classify every page clicked by the user into one of its
categories, leading to the construction of individual-
domain profiles, similar to our explained work. However,
their method differs from ours in the algorithms followed
for keyword extraction and categorization.
Ref. [3] proposed the development of UCAIR - a
decision-theoretic framework for implicit user modeling
based on query expansion and click-through information.
As search engine queries are usually short, the user
models based on them are understandably impoverished.
Hence, query expansion is utilized to enrich the notion of
what the user desires. The need to perform the user
modeling at the client side is also stressed, to reduce the
server load drastically [3]. However, as mentioned
before, to exploit previous queries and the corresponding
click-through data, UCAIR needs to judge whether two
adjacent queries belong to the same logical session. Our
framework overcomes this hurdle by making use of the
DMOZ ontology to define the 'topic' of a search session,
and extract keywords from every document to
automatically group together information from various
visited web pages. Moreover, UCAIR focuses on user
modeling based on short-term contexts, while our
framework remembers the user's interests over a long
period of time.
Ref. [19] evaluates the various ontology-based
methods to model user interests – primarily by changing
the type of inputs provided for mapping interests onto the
ODP ontology. The two best inputs to consider while
understanding the user's information needs are (i) the text
content of the web pages dwelled upon and frequently
visited by the user, and (ii) the queries input by the user,
in an expanded format [19]. We utilize the page content
data for building the part profiles and the expanded-query
approach to aid search result re-ranking. Ref. [20]
suggested building builds a user profile comprising of
data from previous search query terms. However, since
the interpretation of a query by a search engine may be
erroneous, we focus on web pages viewed by a user,
weighted by the amount of attention given to each of
them.
A reasonably good method to expand a search query
is to derive additional terms from the summaries of the
top 10-50 search results, depending on the available
resources as explained in [3, 19]. Query expansion not
only resolves the problem of poor vocabulary of the
original query, but also avoids the issues occurring due to
word mismatch [19].
For categorizing a vector using the DMOZ category
tree, the OBIWAN system maps every visited web page
to five different categories [5]. Categorization is done by
comparing a test vector with the vector corresponding to
every single category [17]. Our system improves upon
this by using a tree-traversing algorithm to reduce the
number of comparisons and make the system more
efficient. To improve accuracy, we classify every page to
only one unique category. Cosine similarity is generally
used while mapping a page or query vector onto the
appropriate ODP category [3, 11, 17, 19]. In this work
we use a modified version of Tanimoto coefficient, also
known as the extended Jaccard coefficient [22], as an
indicator of vector-vector similarity.
The method usually used for re-ranking of search
results is sorting as per descending order of cosine
similarity. UCAIR utilizes expanded queries and viewed
document summaries to dynamically re-rank unseen
search results [3]. Ref. [11] proposed to compute final
rank of every search result as a linear combination of the
contextual rank (computed using cosine similarity with
context vector) and the keyword rank (original rank
provided by the search engine). This allows flexibility in
assigning weightage to both the type of ranks to get a
'hybrid' rank, focusing on keyword as well as contextual
similarity.
III. THE PROFILING MODEL
A. Single Document Keyword Extraction
We focus on the extraction of keywords from every
web page/document read by the user instead of grouping
together web pages into sessions. To achieve this aim, we
use a modified version of the single-document keyword
extraction algorithm proposed in [12]. Ref. [10]
summarizes the five groups of keyword weighing
methods - (i) a word which appears in a document is
likely to be an index term, (ii) a word which appears
frequently in a document is likely to be an index term,
(iii) a word which appears only in a limited number of
documents is likely to be an index term for these
documents, (iv) a word which appears relatively more
frequently in a document than in the whole database is
likely to be an index term for that document, and (v) a
word which shows a specific distributional characteristic
in the database is likely to be an index term for the
database. Ref. [12] focuses on keywords which show a
specific distributional characteristic in the database. This
is done by weighing them according to the degree of bias
of their co-occurrences, with the frequently appearing
keywords in the document. The co-occurrence bias can
be measured quantitatively by calculating the statistical
value of χ2
as an index of biases [12]. The said value is
given by the equation
𝜒2 𝑤 = ((𝑓𝑟𝑒𝑞 𝑤 , 𝑔 −𝑛𝑤𝑝𝑔)
𝑛𝑤𝑝𝑔𝑔∈𝐺
2
, (1)
where w is a certain keyword, g is a keyword from G, the
set of most commonly occurring keywords, freq(w, g)
denotes the number of co-occurrences of keywords w and
g, nw denotes the number of keywords present in the
sentences in which w occurs and pg denotes the
percentage of keywords that occur in sentences in which
g is found. We construct G by considering the top one
third of most frequently occurring keywords in a given
page. Ref. [12] suggested using only G for co-occurrence
measures, since only the frequently occurring words in a
document strongly determine the occurrence
distributions.
In (1), nwpg denotes the 'expected' co-occurrence of w
and g, while freq(w, g) is the actual value of this quantity.
Hence, larger values of χ2(w) indicate a stronger bias in
co-occurrence of w with the keywords in G. According to
experimental results, this method proves comparable to
the popular tf-idf (term frequency-inverse document
frequency) scale for domain-independent keyword
extraction. Since the domain of knowledge a web page
belongs to will not be known beforehand, this weighing
scale is very useful in our methodology.
However, just a strong co-occurrence bias may not be
sufficient to measure the importance of a keyword in a
document. For example, words that occur very rarely in a
document may also end up having very biased co-
occurrence distributions, but may not be important in
defining its context. Therefore, we have modified the
importance index by considering the weighing methods
defined in [10], namely term frequencies. Hence, the
quantity we propose for measuring importance of
keywords in a single document is -
I(w) = tf (w) χ2(w) , (2)
where tf(w) denotes the term frequency of word w in the
given document, and χ2(w) is given by (1).
This ensures that only those keywords, which not only
occur frequently but also show a specific distributional
characteristic in the document, are given the most
importance. The keywords considered for this algorithm
were extracted from a text document in the form of
unigrams and bigrams using the APriori Algorithm [21]
after preliminary pre-processing such as stop word
removal followed by stemming [7]. Experimental
evidence shows that this method is very effective in
extracting keywords from single web pages, without
requiring a corpus.
We thus construct a bag-of-words representation of
any document with the weightages given to every
keyword being equal to I(w) (2).
B. Construction of the ODP Tree
The ODP ontology, also known as DMOZ, is the
largest human edited directory of Internet sites [1]. We
build a hierarchical tree of vectors from this ontology,
with each node denoting one category, and its children
depicting its sub-categories. Ref. [5] showed that while
generating representative vectors for DMOZ categories,
5-60 documents-worth of data provided quite accurate
results. The method we follow is similar to the one used
by [17] while generating individual category profiles, but
with an added recursive step in the end.
We first construct a tree data structure based on the
DMOZ ontology. Then, for every category, we append
all its URL descriptions together (not considering its sub-
categories). This piece of text can now be considered as a
'document' describing that particular topic. We then
apply the single-document keyword extraction algorithm
described in Section 3.A on the category document to get
a bag-of-words vector representation for it. Before
storing it into the corresponding tree node, the vector is
normalized to avoid bias towards categories with larger
number of associated URL descriptions.
After this initialization, we run a recursive algorithm
to update ever node's vocabulary based on that of its
children. This method can be expressed as,
𝑉𝑁,𝑓𝑖𝑛𝑎𝑙 =
𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑(𝑉𝑁,𝑖𝑛𝑖𝑡𝑖𝑎𝑙 + 𝜔(𝑛) 𝑉𝑛 ,𝑓𝑖𝑛𝑎𝑙𝑛∈𝐶(𝑁) ) , (3)
where N denotes a particular node, VN,initial denotes initial
vector of node N constructed from URL descriptions,
VN,final denotes the final vector of N, C(N) denotes the set
of its children, and ω(n) is the weightage factor for
vector of child n. For our experiments, we take ω(n) = 1.
Thus, we construct the DMOZ ontology tree used for
user profiling, with each category-node being assigned an
appropriate bag-of-words representation in the vector
space model.
C. Vector Categorization
We now describe the tree-traversing vector
categorization method used in our framework. Consider a
normalized document vector Vtest constructed using the
method as in Section 3.A. To classify it to a category of
the DMOZ ontology, we use the algorithm (given in the
form of Python code) described below-
currentnode = Top
current_similarity = Tsim(Top, vtest)
while True:
if currentnode.number_of_children == 0:
break
similarities =
for child in currentnode.children:
similarities[child] = Tsim(child.vector,
vtest)
max_child = max(similarities, key =
similarities.get)
max_similarity = similarites[max_child]
if max_similarity > current_similarity:
currentnode = max_child
current_similarity = max_similarity
else:
break
The above algorithm, at every node, keeps traversing
to the child with the highest similarity to the test vector,
provided the similarity is greater than that with the
current node. This ensures that the tree traversal stops at
the category node with the highest similarity to the test
vector. The measure of similarity used in this algorithm
is a modified version of the extended Jaccard coefficient,
also known as the Tanimoto coefficient, given by
𝑇𝑠𝑖𝑚 𝑉𝑁 , 𝑉𝑡𝑒𝑠𝑡 = 𝜙(𝑉𝑁 ,𝑉𝑡𝑒𝑠𝑡 ) . 𝑉𝑡𝑒𝑠𝑡
|𝜙(𝑉𝑁 ,𝑉𝑡𝑒𝑠𝑡 )|2+ |𝑉𝑡𝑒𝑠𝑡 |2−𝜙(𝑉𝑁 ,𝑉𝑡𝑒𝑠𝑡 ) . 𝑉𝑡𝑒𝑠𝑡
...(4)
where
𝜙(𝑉𝑁 , 𝑉𝑡𝑒𝑠𝑡 ) = 𝑤 𝑤 ∈ 𝑉𝑁 & 𝑤 ∈ 𝑉𝑡𝑒𝑠𝑡 (5)
This form of Tanimoto coefficient proves to be better
suited for vector categorization. The basic idea is to
consider only that component of the node‟s vector that
consists of the dimensions (keywords) the test vector
comprises of. Since a category vector may potentially
have the concepts from various other sub-domains apart
from the user‟s interest, considering all of them in
calculation is futile.
D. Building of Part-Profiles
Consider a user with part-profile tree P, who views a
document with normalized vector d (extracted using
method of Section 3.A). Suppose it gets classified to the
part-profile Pi, based on the algorithm discussed in
Section 3.C. It is to be noted that the classification is
done using the part-profile tree, not the original ODP
ontology tree.
If Tsim(d, VPi) > Ɛ, where Ɛ is a pre-defined threshold
value, then the part-profile Pi is enriched using the
formula
Pi,new = normalize(Ω(d, ni) * d + (1- Ω (d, ni)) * VPi) (6)
where ni is the total number of web pages visited by the
user pertaining to part-profile Pi and Ω(d, ni) is the
„importance‟ function of the total time a user spends on
the document, and ni. In our framework, we take Ω (d, ni)
= 1/ni, to ensure that recently viewed documents affect
the category profile adequately enough to lead to short-
term contextualization. Finally, the value of ni is
incremented by 1 to denote the visit to one more web
page.
If Tsim(d, VPi) < Ɛ, a new part-profile is initialized for
the user, using the original DMOZ ontology tree and the
categorization algorithm presented in Section 3.C. The
category vector for the new part profile is again
initialized using (6) (with nj = 1, if j is the category
computed). We empirically find the optimum value of
the threshold Ɛ by considering the similarities of DMOZ
categories with irrelevant test vectors.
Thus, we construct a part-profile tree for every user
based on the web documents read / seen by him. Initially,
the profile tree is empty. This tree grows and is enriched
based on the concepts present in the DMOZ ontology and
the data extracted from the web pages visited by the user.
E. Search Result Re-ranking
For search result re-ranking, we first employ query
augmentation on the original query to enrich it. This is
done based on the keyword-similar search results
provided by the search engine. We append the summaries
of the top 10 results yielded by the search engine into a
document, and then apply our single-document keyword
extraction algorithm on it to form an expanded query
vector q.
This query vector is categorized into the user part-
profile tree using the algorithm described in Section 3.D.
Let the calculated part-profile be Pi. If the value of Tsim(d,
VPi) > Ɛ, then the categorical vector used for result re-
ranking is given as
VC = normalize(q + Pi). (7)
If Tsim(q, Pi) < Ɛ, a new part-profile is generated using
the original ODP tree. Vc is again calculated using (7).
Let the bag-of-words vector representation of the ith
search result be Vi. Here, 'i' is the keyword rank of the
particular search result, given by the search engine. The
'category ranks' of the search results are calculated by
sorting them in decreasing order of Tsim(Vi, VC). Let the
category rank of the ith
search result be j. We calculate
the resultant rank of this page by computing a linear
combination of the keyword rank and the category rank
[11].Thus, the final rank R is given by,
R = α * i + (1-α) * j (8)
where α is the weightage given to the keyword rank, and
0<α<1. We take ' α ' to be 0.6, since categorical ranks
usually tend to prove more relevant than keyword ranks
in result re-ranking [11].
The process of result re-ranking can be made dynamic
by updating the user part-profile tree as he views certain
result pages, and then re-ranking the unseen results based
on the enriched part-profile.
IV. EXPERIMENTS AND RESULTS
Fig. I demonstrates the effectiveness of the single-doc
keyword extraction algorithm we have presented in
Section 3.A. The screenshot shows our custom Python
library extracting terms from a file on the special theory
of relativity. The figures beside the terms denote their
weightage in the normalized bag-of-words representation
of the shown text. The top 20 stemmed unigrams
extracted from the file, along with their weightages are
shown in Table I.
TABLE I. rel 0.17325919546100588
frame 0.05874220897407544
special 0.05093702536421334
refer 0.04865403837793408
speed 0.044998486481664574
light 0.040486875067201014
time 0.027813976395968706
principl 0.025761707866710587
measur 0.020836409997224744
energi 0.01970431573768541
theori 0.017708931245232745
system 0.015751421261680953
physic 0.014930628176243842
transform 0.014494312458276589
event 0.0140440808920495
Similarly, the top 20 bigrams, with their corresponding
weightages are as shown in Table II.
TABLE II. special relativity 0.024982184330494275
reference frame 0.012730133990057175
lorentz transformation 0.003203455449059297
general relativity 0.001782259098970622
michelson–morley experiment 0.001440681252113027
reference frames 0.001371058874856529
special theory 0.001263281003791401
relativity theory 0.001186108126858894
newtonian mechanics 0.001127888028081172
electromagnetic waves 0.000967613086056779
slow velocities 0.000710136766207904
inertial system 0.000658202848743228
= mc 0.000568514485443002
lorentz transformations 0.000508609548838610
physical laws 0.000504039434286712
The demo shot also shows us loading a DMOZ tree
from a python-serialized format, and using it to classify
the aforementioned file. The category path of the file
comes out to be ‘Top/Science/Physics/Relativity‟, which
is accurate as expected. There are some irregularities in
categorization, due to metadata terms occurring in vector
representations. That is because of incorrect processing
of HTML and pre-formatted text which we shall fix.
Some more examples of categorization are shown in
Table III.
TABLE III. Actual Topic
Predicted Category as per DMOZ
Machine
Learning
Top/Computers/Artificial_Intelligence/Machine_Learni
ng
Genetic mutation
Top/Science/Biology/Genetics
Hydrocarbo
n Fuels
Top/Science/Environment/Environmental_Monitoring
Electrical Engineering
Top/Computers/E-Books
Supervised
Learning
Top/Computers/Artificial_Intelligence/Machine_Learni
ng
Combinatorics
Top/Science/Math/Combinatorics
Spectroscop
y
Top/Science/Technology/Lighting
Support Vector
Machines
Top/Computers/Artificial_Intelligence/Support_Vector_Machines
To demonstrate the process of result re-ranking, we
generated a user part-profile tree based on the following
Wikipedia articles- (i) Genetics, (ii) Bio-technology, (iii)
Gene sequencing and (iv) The Human genome. Then,
based on the corresponding generated tree, search
optimization was performed for the Google search query
„biotechnology dna‟ with respect to the first ten results.
The re-ranking was done using (8) and taking α = 0.6.
The re-ranking numbers are shown below-
Fig. I
1) Biotechnology and DNA Structure, Replication, and
Technology
Keywords - access, cloudflar, &bull, shmoop, www,
access denied, deni, ban, signature accebccbua, signatur Keyword Rank
= 1
Contextual Rank
= 9
Actual Rank =
5.8
2) Biotechnology - DNA Fingerprinting - SlideShare
Keywords - view, dna, fingerprint, dna fingerprinting,
biotechnolog, present, biotechnology dna, fingerprinting
views, system Keyword Rank
= 2
Contextual Rank
= 2
Actual Rank =
2.0
3) Recombinant DNA and Biotechnology - CliffsNotes
Keywords - dna, gene, protein, cell, diseas, recombin,
recombinant dna, plant, bacteria, system Keyword Rank
= 3
Contextual Rank
= 1
Actual Rank =
1.8
4) DNA Sequencing Facility - University of Wisconsin
Biotechnology.
Keywords - sequenc, dna, center, inform, uwbc, facil,
research, contact, dna sequencing, servic Keyword Rank = 4
Contextual Rank = 7
Actual Rank = 5.8
5) DNA profiling - interactive - Biotechnology Online
Keywords - profil, dna, interact, profiling interactive, dna
profiling Keyword Rank = 5
Contextual Rank = 6
Actual Rank = 5.6
6) 17: Recombinant DNA and Biotechnology
Keywords - None Keyword Rank
= 6
Contextual Rank
= 10
Actual Rank =
8.4
7) DNA Biotechnology Facility - Biotechnology -
Bioservices Center
Keywords - facil, dna, servic, sequenc, analysi, center,
instrument, program, biotechnolog, dna sequencing Keyword Rank
= 7
Contextual Rank
= 4
Actual Rank =
5.2
8) DNA Sequencing and Genomics Laboratory (Institute
of ...
Keywords - sequenc, genom, dna, laboratori, dna
sequencing, helsinki, univers, project, tel + Keyword Rank
= 8
Contextual Rank
= 8
Actual Rank =
8.0
9) DNA Sequencing Technology: Nature Biotechnology
Keywords - sequenc, dna, dna sequencing, technolog,
full, pdf kb, text, pdf, full text, & Keyword Rank
= 9
Contextual Rank
= 5
Actual Rank =
6.6
10) Bio-6--DNA Fingerprinting in Human Health and
Society
Keywords - dna, fingerprint, dna fingerprints, extens,
cooper, disord, dna fingerprint, servic, univers, inform Keyword Rank = 10
Contextual Rank = 3
Actual Rank = 5.8
The final re-ranked results (from best to worst) can be
shown as,
Recombinant DNA and Biotechnology - CliffsNotes
Biotechnology - DNA Fingerprinting - SlideShare
DNA Biotechnology Facility - Biotechnology -
Bioservices Center
DNA Profiling - interactive - Biotechnology Online
Bio-6--DNA Fingerprinting in Human Health and
Society
Biotechnology and DNA Structure, Replication, and
Technology
DNA Sequencing Facility - University of Wisconsin
Biotechnology...
DNA Sequencing Technology: Nature Biotechnology
DNA Sequencing and Genomics Laboratory (Institute of
...)
17: Recombinant DNA and Biotechnology
Understandably, the articles with the most relevant
data are ranked higher. For example, the articles
pertaining to specific university departments are pushed
lower down, while pages with relevant theory get a better
rank after considering the contextual factor.
V. CONCLUSIONS AND FUTURE WORK
In this paper, we presented an efficient and sound
method for implicit client-side user profiling. The usage
of the DMOZ ontology for creating 'part-profiles' ensures
effective contextualization of the user's interests, and
enriches the vocabulary of the profiles sufficiently. Using
a human-edited wealth of information also ensures that
we don't face the problem of a 'cold-start' creating an
impoverished initial part-profile.
Our method builds upon previous work in this domain
by optimizing the categorization algorithm using the
unique nature of the ODP ontology. This ensures speed
and accuracy in categorization of a user's active interests
in a search session. The single-document keyword
extraction algorithm presented in Section 3.1 fits
perfectly into the infrastructure, ensuring that every web
page visited by the user contributes to his model, in the
short-term and long-term sense. We forgo the need to
define logical sessions of the user's work, which is the
basis of a lot of related work in literature, focusing
instead on making every action taken by him contribute
to the relevant part-profile.
One possible optimization to our method would be to
have a global repository for vector representations of
popular web-pages, so that the keyword extraction
algorithm doesn't have to be run again and again on
every client's machine for the same version of a web
page. The bag-of-words vector representations of the
user's part-profiles can also be used for other web-search
personalization tasks such as search query auto-
completion, automated spell-checking, etc.
Currently, active work is being done to ensure better
processing of formatted web data so that only relevant
text is processed by the keyword extraction algorithm.
We also plan to integrate our work into a web-browser as
a plug-in, to enable seamless integration of our
techniques into the search process. To provide greater
power to the user, we plan to give him the option to view
his individual part-profiles, and the ability to 'switch-on',
enable, and 'switch-off‟, disable, any of them, helping
customize the search experience even more [2].
One possible short-coming of our approach would be
the language barrier, since the ODP is only an English-
based ontology. This can however be remedied using
Natural Language Processing techniques to translate the
individual document vectors into those of English words,
so that they can be mapped to the relevant DMOZ
categories. In this sense too, the development of a global
collection of web-page vectors would be beneficial.
Thus, our work has many possible applications in the
field of web search personalization, apart from result re-
ranking, and its flexibility ensures that it can fit the needs
of every domain of study very well.
REFERENCES
[1] The Open Directory Project (ODP) (2006), Available
online http://dmoz.org.
[2] Pitkow, J. et al. 2002. Personalized search.
Communications of the ACM 45, 9, pp. 50–55.
[3] Shen, X., Tan, B., and Zhai, C. X. (2005). “Implicit
User Modeling for Personalized Search,” in Proceedings
of the 14th ACM International Conference on
Information and Knowledge Management. Bremen,
Germany, pp. 824–831.
[4] Tan, A. H. and Teo, C. (1998). “Learning User
Profiles for Personalized Information Dissemination,” in
Proceedings of International Joint Conference on Neural
Network. Anchorage, AK, 183–188.
[5] Gauch, S., Chaffee, J., and Pretschner, A. 2003.
“Ontology-Based Personalized Search and Browsing
Web Intelligence and Agent Systems,” 1, 3/4, pp. 219–
234.
[6] Gruber, Toward. (1995) “Principles for the Design of
Ontologies Used for Knowledge Sharing,” International
Journal of Human-Computer Studies (43:5-6), 1995, pp.
907-928.
[7] Porter, M.F.( 1980). “An Algorithm for Suffix
Stripping,” (14:3), 1980, pp. 130-137.
[8] Widyantoro, H, Ioerger, T & Yen J. (2000).
“Learning User Interest Dynamics with a Three-
Descriptor Representation,” in Journal of the American
Society for Information Science, 52(3), pp. 212-225.
[9] Sachin Singh, Pravin Vajirkar, and Yugyung Lee.
(2003). “Context-aware Data Mining using Ontologies,”
in the 14th international conference proceeding
DEXATM (Database and EXpert systems Applications)
[10] Kageura, K., and Umino, B. (1996). “Methods of
Automatic Term Recognition Terminology,” 3(2):259.
[11] Vishnu Challam, Susan Gauch, Aravind
Chandramouli (2007). “Contextual Search Using
Ontology-Based User Profiles,” RIAO 2007
[12] Yutaka Matsuo, Mitsuru Ishizuka. (2004).
“Keyword Extraction from a Single Document Using
Word Co-occurrence Statistical Information,”
International Journal on Artificial Intelligence Tools
13(1): pp. 157-169 (2004).
[13] Sellen, A. J., Murphy, R., and Shaw, K. L. (2002).
“How Knowledge Workers Use the Web,” in
Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems: Changing our World,
Changing Ourselves. Minneapolis, MN, pp. 227–234.
[14] Lawrence, S. (2000). “Context in Web search,”
IEEE Data Engineering Bulletin, 23(3), pp. 25–32.
[15] D. Kelly and J. Teevan. (2003) “Implicit Feedback
for Inferring User Preference: A bibliography,” SIGIR
Forum, 37(2), pp.18–28, 2003.
[16] Genesereth, M. R., & Nilsson, N. J. (1987). Logical
Foundations of Artificial Intelligence. San Mateo, CA:
Morgan Kaufmann Publishers.
[17] Zhongming Ma, Gautam Pant, Olivia R. Liu Sheng.
“Interest-Based Personalized Search,” ACM Trans. Inf.
Syst. 25(1) (2007)
[18] Jansen, B. J. and Spink, A. (2005). “An Analysis of
Web Searching by European AlltheWeb.com Users,”
Information Processing and Management 41, pp. 361–
381.
[19] Z. Ma, O. R. L. Sheng, G. Pant. (2005) “Evaluation
of Ontology-based User Interests Modeling,” 4th
Workshop on e-Business, 2005.
[20] Liu, F., Yu, C., and Meng, W. (2004). “Personalized
Web search for Improving Retrieval Effectiveness,”
IEEE Trans. Knowledge Data Engineering, 16(1), pp.
28–40.
[21] Fürnkranz, J. (1998). “A Study using n-grams
Features for Text Categorization,” Technical Report
OEFAI-TR-98-30, Austrian Research Institute for
Artificial Intelligence.
[22] Tan, Pang-Ning; Steinbach, Michael; Kumar, Vipin
(2005), Introduction to Data Mining, ISBN 0-321-32136-
7