Tree Based Ontological User Profiling for Web Search Optimization

Tree-Based Ontological User Profiling for Web

Search Optimization

Sachin R. Joglekar1 and Dr. Mangesh Bedekar

2

1BITS-Pilani, K. K. Birla Goa Campus, Goa, India

Email: [email protected] 2Maharashtra Institute of Technology, Kothrud, Pune, India

Email: [email protected]

Abstract--Today, the internet represents the most

important source of knowledge for the average user.

As the number of users turning to search engines for

information is increasing, web search personalization

is becoming an important domain of research in

information retrieval. It is essential to adapt the

process of web search according to the needs of every

individual, without explicit actions from the user's

side. In this paper, we demonstrate an implicit client-

side method for user profiling using an ontological

tree, for efficient contextualization of a user's

interests. Our developed system is able to build 'part-

profiles' of a user over time, given his web usage data.

Each of these part-profiles pertains to one domain of

interest of the said user. We also demonstrate the

usage of these profiles for search result re-ranking by

exploiting a tree-traversing algorithm, to ensure

faster knowledge gain from the user's perspective.

The methods described in this paper can also be used

for other methods of web search optimization, such as

query expansion.

Index terms–User Profiling, Web Search, Ontology

I. INTRODUCTION

A knowledge worker can be defined as someone

“whose paid work involves significant time spent in

gathering, finding, analyzing, creating, producing or

archiving information” [13]. Essentially speaking, every

web user in today's scenario is a knowledge worker,

using search engines extensively to obtain required

knowledge off the web. Considering the ever-increasing

amount of text data being added to the internet, it is

important to make it easier for the user to get to the

information he wants, as efficiently as possible. With this

aim in mind, the one-size-fits-all approach [14] is no

longer recommended, especially considering the

polysemous nature of words in the English vocabulary.

The need to model the web experience of an individual as

per his specific interests is more important than ever.

The best possible way to understand what a user needs

from his web experience is relevance feedback for web

pages from the user himself [4]. However, a user is

usually reluctant to take the extra effort needed to

provide such information on a regular basis explicitly

[15]. Hence, implicit user modeling is required, to avoid

the extra hassle from the user's side. Moreover, fast

processing of a user's interests will only be possible if it

is achieved at the individual client-side, helping

distribute the necessary computations instead of

burdening the server-side infrastructure.

To achieve the aim of implicit client-side search

optimization, we take the help of the DMOZ

(directory.mozilla.org) ontology. Ontology is defined as

“an explicit specification of a conceptualization- the

objects, concepts, and other entities that are assumed to

exist in some area of interest and the relationships that

hold among them” [6, 16]. With the idea of accurately

representing a user's information needs and search

interests, we propose a system to implicitly build an

ontological tree of 'part-profiles'. The ontology used for

this purpose is the DMOZ/ODP (Open Directory Project)

ontology [1]. Every part-profile, as constructed by this

method, would denote an area of the user's interest and

help in optimizing web search pertaining to it.

The aforementioned tree of ontological part profiles

will be updated dynamically, as the user visits various

web pages. Such dynamic generation of user profiles has

been explored in the UCAIR system [3]. However,

UCAIR requires the definition of 'logical sessions' of a

user's web activities, where every distinct session would

pertain to a unique profile. Our method overcomes this

shortcoming by extracting important keywords from

every web page viewed by a user, and exploiting the

ODP hierarchy to understand the field of study of a web

search [12]. The keywords extracted from a certain web

page form a weighted vector in the bag-of-words

representation of the page, which is then categorized into

one of the user's profiles. Ref. [17] explored a similar

method for ontological categorization of visited pages.

We improve upon their method by using a tree-based

algorithm over the profile tree for fast categorization.

This avoids the need to compare a certain page vector

with every category in the tree. When the similarity of a

page vector to its predicted category falls below a pre-

defined threshold, a new part-profile is initialized for the

user.

https://www.researchgate.net/publication/3753738_Learning_user_profiles_for_personalized_information_dissemination?el=1_x_8&enrichId=rgreq-1bb3bcc544da6cb11b6f2a0a312d505c-XXX&enrichSource=Y292ZXJQYWdlOzI1OTQ3ODkyNjtBUzoxMDQwMjc2NjQ2ODMwMTBAMTQwMTgxMzUzNjE4Mw==

https://www.researchgate.net/publication/32964478_Implicit_User_Modeling_for_Personalized_Search?el=1_x_8&enrichId=rgreq-1bb3bcc544da6cb11b6f2a0a312d505c-XXX&enrichSource=Y292ZXJQYWdlOzI1OTQ3ODkyNjtBUzoxMDQwMjc2NjQ2ODMwMTBAMTQwMTgxMzUzNjE4Mw==

https://www.researchgate.net/publication/2626138_Toward_Principles_for_the_Design_of_Ontologies_Used_for_Knowledge_Sharing?el=1_x_8&enrichId=rgreq-1bb3bcc544da6cb11b6f2a0a312d505c-XXX&enrichSource=Y292ZXJQYWdlOzI1OTQ3ODkyNjtBUzoxMDQwMjc2NjQ2ODMwMTBAMTQwMTgxMzUzNjE4Mw==

https://www.researchgate.net/publication/221519552_How_Knowledge_Workers_Use_the_Web?el=1_x_8&enrichId=rgreq-1bb3bcc544da6cb11b6f2a0a312d505c-XXX&enrichSource=Y292ZXJQYWdlOzI1OTQ3ODkyNjtBUzoxMDQwMjc2NjQ2ODMwMTBAMTQwMTgxMzUzNjE4Mw==

Personalization of web search is attained by

constructing an expanded version of the user's original

query vector using the summaries of the top search

results from a search engine. This vector is then either

classified to a part profile of the user (if the similarity is

above the threshold) or added to a newly generated

profile based on the DMOZ ontology. Then, result re-

ranking is done on the top search results [18] by

computing their final rank based on an algorithm inspired

by [11].

The remaining sections are organized as follows. In

Section 2, we discuss the related work. In Section 3, we

present our complete methodology for dynamic user

profiling using the ODP ontology tree. In Section 4, we

demonstrate the experimental results obtained from

implementing the framework described in Section 3. In

Section 5, we present discussions and critique points

pertaining to our work. Section 6 concludes the

discussion of our research.

II. RELATED WORK

Ref. [2] very appropriately defines the two pillars of

personalized search- contextualization and

individualization. Contextualization refers to the

definition and representation of the background

information regarding a user's work and the nature of his

search interests. Individualization refers to the

distinguishing factors and data pertaining to user's own

unique information needs. Usage of an ontology for user

profiling satisfies both these requirements - the extensive

vocabulary of DMOZ aids adequate representation of a

user's background (contextualization), while the

hierarchy of domains in the hierarchical tree helps in

accurate descriptions of the topics of interest

(individualization). Ref. [9] recognized the four main

types of context in web applications- domain, location,

data and user. With the help of our system, we exploit

data context (mining the ODP wealth of knowledge for

content) and user context (focusing on the web pages

viewed by the user). Ref. [2] uses the ODP ontology to

classify every page clicked by the user into one of its

categories, leading to the construction of individual-

domain profiles, similar to our explained work. However,

their method differs from ours in the algorithms followed

for keyword extraction and categorization.

Ref. [3] proposed the development of UCAIR - a

decision-theoretic framework for implicit user modeling

based on query expansion and click-through information.

As search engine queries are usually short, the user

models based on them are understandably impoverished.

Hence, query expansion is utilized to enrich the notion of

what the user desires. The need to perform the user

modeling at the client side is also stressed, to reduce the

server load drastically [3]. However, as mentioned

before, to exploit previous queries and the corresponding

click-through data, UCAIR needs to judge whether two

adjacent queries belong to the same logical session. Our

framework overcomes this hurdle by making use of the

DMOZ ontology to define the 'topic' of a search session,

and extract keywords from every document to

automatically group together information from various

visited web pages. Moreover, UCAIR focuses on user

modeling based on short-term contexts, while our

framework remembers the user's interests over a long

period of time.

Ref. [19] evaluates the various ontology-based

methods to model user interests – primarily by changing

the type of inputs provided for mapping interests onto the

ODP ontology. The two best inputs to consider while

understanding the user's information needs are (i) the text

content of the web pages dwelled upon and frequently

visited by the user, and (ii) the queries input by the user,

in an expanded format [19]. We utilize the page content

data for building the part profiles and the expanded-query

approach to aid search result re-ranking. Ref. [20]

suggested building builds a user profile comprising of

data from previous search query terms. However, since

the interpretation of a query by a search engine may be

erroneous, we focus on web pages viewed by a user,

weighted by the amount of attention given to each of

them.

A reasonably good method to expand a search query

is to derive additional terms from the summaries of the

top 10-50 search results, depending on the available

resources as explained in [3, 19]. Query expansion not

only resolves the problem of poor vocabulary of the

original query, but also avoids the issues occurring due to

word mismatch [19].

For categorizing a vector using the DMOZ category

tree, the OBIWAN system maps every visited web page

to five different categories [5]. Categorization is done by

comparing a test vector with the vector corresponding to

every single category [17]. Our system improves upon

this by using a tree-traversing algorithm to reduce the

number of comparisons and make the system more

efficient. To improve accuracy, we classify every page to

only one unique category. Cosine similarity is generally

used while mapping a page or query vector onto the

appropriate ODP category [3, 11, 17, 19]. In this work

we use a modified version of Tanimoto coefficient, also

known as the extended Jaccard coefficient [22], as an

indicator of vector-vector similarity.

The method usually used for re-ranking of search

results is sorting as per descending order of cosine

similarity. UCAIR utilizes expanded queries and viewed

document summaries to dynamically re-rank unseen

search results [3]. Ref. [11] proposed to compute final

rank of every search result as a linear combination of the

contextual rank (computed using cosine similarity with

context vector) and the keyword rank (original rank

provided by the search engine). This allows flexibility in






https://www.researchgate.net/publication/221269792_Context-Based_Data_Mining_Using_Ontologies?el=1_x_8&enrichId=rgreq-1bb3bcc544da6cb11b6f2a0a312d505c-XXX&enrichSource=Y292ZXJQYWdlOzI1OTQ3ODkyNjtBUzoxMDQwMjc2NjQ2ODMwMTBAMTQwMTgxMzUzNjE4Mw==

assigning weightage to both the type of ranks to get a

'hybrid' rank, focusing on keyword as well as contextual

similarity.

III. THE PROFILING MODEL

A. Single Document Keyword Extraction

We focus on the extraction of keywords from every

web page/document read by the user instead of grouping

together web pages into sessions. To achieve this aim, we

use a modified version of the single-document keyword

extraction algorithm proposed in [12]. Ref. [10]

summarizes the five groups of keyword weighing

methods - (i) a word which appears in a document is

likely to be an index term, (ii) a word which appears

frequently in a document is likely to be an index term,

(iii) a word which appears only in a limited number of

documents is likely to be an index term for these

documents, (iv) a word which appears relatively more

frequently in a document than in the whole database is

likely to be an index term for that document, and (v) a

word which shows a specific distributional characteristic

in the database is likely to be an index term for the

database. Ref. [12] focuses on keywords which show a

specific distributional characteristic in the database. This

is done by weighing them according to the degree of bias

of their co-occurrences, with the frequently appearing

keywords in the document. The co-occurrence bias can

be measured quantitatively by calculating the statistical

value of χ2

as an index of biases [12]. The said value is

given by the equation

𝜒2 𝑤 = ((𝑓𝑟𝑒𝑞 𝑤 , 𝑔 −𝑛𝑤𝑝𝑔)

𝑛𝑤𝑝𝑔𝑔∈𝐺

2

, (1)

where w is a certain keyword, g is a keyword from G, the

set of most commonly occurring keywords, freq(w, g)

denotes the number of co-occurrences of keywords w and

g, nw denotes the number of keywords present in the

sentences in which w occurs and pg denotes the

percentage of keywords that occur in sentences in which

g is found. We construct G by considering the top one

third of most frequently occurring keywords in a given

page. Ref. [12] suggested using only G for co-occurrence

measures, since only the frequently occurring words in a

document strongly determine the occurrence

distributions.

In (1), nwpg denotes the 'expected' co-occurrence of w

and g, while freq(w, g) is the actual value of this quantity.

Hence, larger values of χ2(w) indicate a stronger bias in

co-occurrence of w with the keywords in G. According to

experimental results, this method proves comparable to

the popular tf-idf (term frequency-inverse document

frequency) scale for domain-independent keyword

extraction. Since the domain of knowledge a web page

belongs to will not be known beforehand, this weighing

scale is very useful in our methodology.

However, just a strong co-occurrence bias may not be

sufficient to measure the importance of a keyword in a

document. For example, words that occur very rarely in a

document may also end up having very biased co-

occurrence distributions, but may not be important in

defining its context. Therefore, we have modified the

importance index by considering the weighing methods

defined in [10], namely term frequencies. Hence, the

quantity we propose for measuring importance of

keywords in a single document is -

I(w) = tf (w) χ2(w) , (2)

where tf(w) denotes the term frequency of word w in the

given document, and χ2(w) is given by (1).

This ensures that only those keywords, which not only

occur frequently but also show a specific distributional

characteristic in the document, are given the most

importance. The keywords considered for this algorithm

were extracted from a text document in the form of

unigrams and bigrams using the APriori Algorithm [21]

after preliminary pre-processing such as stop word

removal followed by stemming [7]. Experimental

evidence shows that this method is very effective in

extracting keywords from single web pages, without

requiring a corpus.

We thus construct a bag-of-words representation of

any document with the weightages given to every

keyword being equal to I(w) (2).

B. Construction of the ODP Tree

The ODP ontology, also known as DMOZ, is the

largest human edited directory of Internet sites [1]. We

build a hierarchical tree of vectors from this ontology,

with each node denoting one category, and its children

depicting its sub-categories. Ref. [5] showed that while

generating representative vectors for DMOZ categories,

5-60 documents-worth of data provided quite accurate

results. The method we follow is similar to the one used

by [17] while generating individual category profiles, but

with an added recursive step in the end.

We first construct a tree data structure based on the

DMOZ ontology. Then, for every category, we append

all its URL descriptions together (not considering its sub-

categories). This piece of text can now be considered as a

'document' describing that particular topic. We then

apply the single-document keyword extraction algorithm

described in Section 3.A on the category document to get

a bag-of-words vector representation for it. Before

storing it into the corresponding tree node, the vector is

normalized to avoid bias towards categories with larger

number of associated URL descriptions.

After this initialization, we run a recursive algorithm

to update ever node's vocabulary based on that of its

children. This method can be expressed as,

https://www.researchgate.net/publication/242637644_An_Algorithm_for_Suffix_Stripping?el=1_x_8&enrichId=rgreq-1bb3bcc544da6cb11b6f2a0a312d505c-XXX&enrichSource=Y292ZXJQYWdlOzI1OTQ3ODkyNjtBUzoxMDQwMjc2NjQ2ODMwMTBAMTQwMTgxMzUzNjE4Mw==

𝑉𝑁,𝑓𝑖𝑛𝑎𝑙 =

𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑(𝑉𝑁,𝑖𝑛𝑖𝑡𝑖𝑎𝑙 + 𝜔(𝑛) 𝑉𝑛 ,𝑓𝑖𝑛𝑎𝑙𝑛∈𝐶(𝑁) ) , (3)

where N denotes a particular node, VN,initial denotes initial

vector of node N constructed from URL descriptions,

VN,final denotes the final vector of N, C(N) denotes the set

of its children, and ω(n) is the weightage factor for

vector of child n. For our experiments, we take ω(n) = 1.

Thus, we construct the DMOZ ontology tree used for

user profiling, with each category-node being assigned an

appropriate bag-of-words representation in the vector

space model.

C. Vector Categorization

We now describe the tree-traversing vector

categorization method used in our framework. Consider a

normalized document vector Vtest constructed using the

method as in Section 3.A. To classify it to a category of

the DMOZ ontology, we use the algorithm (given in the

form of Python code) described below-

currentnode = Top

current_similarity = Tsim(Top, vtest)

while True:

if currentnode.number_of_children == 0:

break

similarities =

for child in currentnode.children:

similarities[child] = Tsim(child.vector,

vtest)

max_child = max(similarities, key =

similarities.get)

max_similarity = similarites[max_child]

if max_similarity > current_similarity:

currentnode = max_child

current_similarity = max_similarity

else:

break

The above algorithm, at every node, keeps traversing

to the child with the highest similarity to the test vector,

provided the similarity is greater than that with the

current node. This ensures that the tree traversal stops at

the category node with the highest similarity to the test

vector. The measure of similarity used in this algorithm

is a modified version of the extended Jaccard coefficient,

also known as the Tanimoto coefficient, given by

𝑇𝑠𝑖𝑚 𝑉𝑁 , 𝑉𝑡𝑒𝑠𝑡 = 𝜙(𝑉𝑁 ,𝑉𝑡𝑒𝑠𝑡 ) . 𝑉𝑡𝑒𝑠𝑡

|𝜙(𝑉𝑁 ,𝑉𝑡𝑒𝑠𝑡 )|2+ |𝑉𝑡𝑒𝑠𝑡 |2−𝜙(𝑉𝑁 ,𝑉𝑡𝑒𝑠𝑡 ) . 𝑉𝑡𝑒𝑠𝑡

...(4)

where

𝜙(𝑉𝑁 , 𝑉𝑡𝑒𝑠𝑡 ) = 𝑤 𝑤 ∈ 𝑉𝑁 & 𝑤 ∈ 𝑉𝑡𝑒𝑠𝑡 (5)

This form of Tanimoto coefficient proves to be better

suited for vector categorization. The basic idea is to

consider only that component of the node‟s vector that

consists of the dimensions (keywords) the test vector

comprises of. Since a category vector may potentially

have the concepts from various other sub-domains apart

from the user‟s interest, considering all of them in

calculation is futile.

D. Building of Part-Profiles

Consider a user with part-profile tree P, who views a

document with normalized vector d (extracted using

method of Section 3.A). Suppose it gets classified to the

part-profile Pi, based on the algorithm discussed in

Section 3.C. It is to be noted that the classification is

done using the part-profile tree, not the original ODP

ontology tree.

If Tsim(d, VPi) > Ɛ, where Ɛ is a pre-defined threshold

value, then the part-profile Pi is enriched using the

formula

Pi,new = normalize(Ω(d, ni) * d + (1- Ω (d, ni)) * VPi) (6)

where ni is the total number of web pages visited by the

user pertaining to part-profile Pi and Ω(d, ni) is the

„importance‟ function of the total time a user spends on

the document, and ni. In our framework, we take Ω (d, ni)

= 1/ni, to ensure that recently viewed documents affect

the category profile adequately enough to lead to short-

term contextualization. Finally, the value of ni is

incremented by 1 to denote the visit to one more web

page.

If Tsim(d, VPi) < Ɛ, a new part-profile is initialized for

the user, using the original DMOZ ontology tree and the

categorization algorithm presented in Section 3.C. The

category vector for the new part profile is again

initialized using (6) (with nj = 1, if j is the category

computed). We empirically find the optimum value of

the threshold Ɛ by considering the similarities of DMOZ

categories with irrelevant test vectors.

Thus, we construct a part-profile tree for every user

based on the web documents read / seen by him. Initially,

the profile tree is empty. This tree grows and is enriched

based on the concepts present in the DMOZ ontology and

the data extracted from the web pages visited by the user.

E. Search Result Re-ranking

For search result re-ranking, we first employ query

augmentation on the original query to enrich it. This is

done based on the keyword-similar search results

provided by the search engine. We append the summaries

of the top 10 results yielded by the search engine into a

document, and then apply our single-document keyword

extraction algorithm on it to form an expanded query

vector q.

This query vector is categorized into the user part-

profile tree using the algorithm described in Section 3.D.

Let the calculated part-profile be Pi. If the value of Tsim(d,

VPi) > Ɛ, then the categorical vector used for result re-

ranking is given as

VC = normalize(q + Pi). (7)

If Tsim(q, Pi) < Ɛ, a new part-profile is generated using

the original ODP tree. Vc is again calculated using (7).

Let the bag-of-words vector representation of the ith

search result be Vi. Here, 'i' is the keyword rank of the

particular search result, given by the search engine. The

'category ranks' of the search results are calculated by

sorting them in decreasing order of Tsim(Vi, VC). Let the

category rank of the ith

search result be j. We calculate

the resultant rank of this page by computing a linear

combination of the keyword rank and the category rank

[11].Thus, the final rank R is given by,

R = α * i + (1-α) * j (8)

where α is the weightage given to the keyword rank, and

0<α<1. We take ' α ' to be 0.6, since categorical ranks

usually tend to prove more relevant than keyword ranks

in result re-ranking [11].

The process of result re-ranking can be made dynamic

by updating the user part-profile tree as he views certain

result pages, and then re-ranking the unseen results based

on the enriched part-profile.

IV. EXPERIMENTS AND RESULTS

Fig. I demonstrates the effectiveness of the single-doc

keyword extraction algorithm we have presented in

Section 3.A. The screenshot shows our custom Python

library extracting terms from a file on the special theory

of relativity. The figures beside the terms denote their

weightage in the normalized bag-of-words representation

of the shown text. The top 20 stemmed unigrams

extracted from the file, along with their weightages are

shown in Table I.

TABLE I. rel 0.17325919546100588

frame 0.05874220897407544

special 0.05093702536421334

refer 0.04865403837793408

speed 0.044998486481664574

light 0.040486875067201014

time 0.027813976395968706

principl 0.025761707866710587

measur 0.020836409997224744

energi 0.01970431573768541

theori 0.017708931245232745

system 0.015751421261680953

physic 0.014930628176243842

transform 0.014494312458276589

event 0.0140440808920495

Similarly, the top 20 bigrams, with their corresponding

weightages are as shown in Table II.

TABLE II. special relativity 0.024982184330494275

reference frame 0.012730133990057175

lorentz transformation 0.003203455449059297

general relativity 0.001782259098970622

michelson–morley experiment 0.001440681252113027

reference frames 0.001371058874856529

special theory 0.001263281003791401

relativity theory 0.001186108126858894

newtonian mechanics 0.001127888028081172

electromagnetic waves 0.000967613086056779

slow velocities 0.000710136766207904

inertial system 0.000658202848743228

= mc 0.000568514485443002

lorentz transformations 0.000508609548838610

physical laws 0.000504039434286712

The demo shot also shows us loading a DMOZ tree

from a python-serialized format, and using it to classify

the aforementioned file. The category path of the file

comes out to be ‘Top/Science/Physics/Relativity‟, which

is accurate as expected. There are some irregularities in

categorization, due to metadata terms occurring in vector

representations. That is because of incorrect processing

of HTML and pre-formatted text which we shall fix.

Some more examples of categorization are shown in

Table III.

TABLE III. Actual Topic

Predicted Category as per DMOZ

Machine

Learning

Top/Computers/Artificial_Intelligence/Machine_Learni

ng

Genetic mutation

Top/Science/Biology/Genetics

Hydrocarbo

n Fuels

Top/Science/Environment/Environmental_Monitoring

Electrical Engineering

Top/Computers/E-Books

Supervised

Learning

Top/Computers/Artificial_Intelligence/Machine_Learni

ng

Combinatorics

Top/Science/Math/Combinatorics

Spectroscop

y

Top/Science/Technology/Lighting

Support Vector

Machines

Top/Computers/Artificial_Intelligence/Support_Vector_Machines

To demonstrate the process of result re-ranking, we

generated a user part-profile tree based on the following

Wikipedia articles- (i) Genetics, (ii) Bio-technology, (iii)

Gene sequencing and (iv) The Human genome. Then,

based on the corresponding generated tree, search

optimization was performed for the Google search query

„biotechnology dna‟ with respect to the first ten results.

The re-ranking was done using (8) and taking α = 0.6.

The re-ranking numbers are shown below-

Fig. I

1) Biotechnology and DNA Structure, Replication, and

Technology

Keywords - access, cloudflar, &bull, shmoop, www,

access denied, deni, ban, signature accebccbua, signatur Keyword Rank

= 1

Contextual Rank

= 9

Actual Rank =

5.8

2) Biotechnology - DNA Fingerprinting - SlideShare

Keywords - view, dna, fingerprint, dna fingerprinting,

biotechnolog, present, biotechnology dna, fingerprinting

views, system Keyword Rank

= 2

Contextual Rank

= 2

Actual Rank =

2.0

3) Recombinant DNA and Biotechnology - CliffsNotes

Keywords - dna, gene, protein, cell, diseas, recombin,

recombinant dna, plant, bacteria, system Keyword Rank

= 3

Contextual Rank

= 1

Actual Rank =

1.8

4) DNA Sequencing Facility - University of Wisconsin

Biotechnology.

Keywords - sequenc, dna, center, inform, uwbc, facil,

research, contact, dna sequencing, servic Keyword Rank = 4

Contextual Rank = 7

Actual Rank = 5.8

5) DNA profiling - interactive - Biotechnology Online

Keywords - profil, dna, interact, profiling interactive, dna

profiling Keyword Rank = 5

Contextual Rank = 6

Actual Rank = 5.6

6) 17: Recombinant DNA and Biotechnology

Keywords - None Keyword Rank

= 6

Contextual Rank

= 10

Actual Rank =

8.4

7) DNA Biotechnology Facility - Biotechnology -

Bioservices Center

Keywords - facil, dna, servic, sequenc, analysi, center,

instrument, program, biotechnolog, dna sequencing Keyword Rank

= 7

Contextual Rank

= 4

Actual Rank =

5.2

8) DNA Sequencing and Genomics Laboratory (Institute

of ...

Keywords - sequenc, genom, dna, laboratori, dna

sequencing, helsinki, univers, project, tel + Keyword Rank

= 8

Contextual Rank

= 8

Actual Rank =

8.0

9) DNA Sequencing Technology: Nature Biotechnology

Keywords - sequenc, dna, dna sequencing, technolog,

full, pdf kb, text, pdf, full text, &amp Keyword Rank

= 9

Contextual Rank

= 5

Actual Rank =

6.6

10) Bio-6--DNA Fingerprinting in Human Health and

Society

Keywords - dna, fingerprint, dna fingerprints, extens,

cooper, disord, dna fingerprint, servic, univers, inform Keyword Rank = 10

Contextual Rank = 3

Actual Rank = 5.8

The final re-ranked results (from best to worst) can be

shown as,

Recombinant DNA and Biotechnology - CliffsNotes

Biotechnology - DNA Fingerprinting - SlideShare

DNA Biotechnology Facility - Biotechnology -

Bioservices Center

DNA Profiling - interactive - Biotechnology Online

Bio-6--DNA Fingerprinting in Human Health and

Society

Biotechnology and DNA Structure, Replication, and

Technology

DNA Sequencing Facility - University of Wisconsin

Biotechnology...

DNA Sequencing Technology: Nature Biotechnology

DNA Sequencing and Genomics Laboratory (Institute of

...)

17: Recombinant DNA and Biotechnology

Understandably, the articles with the most relevant

data are ranked higher. For example, the articles

pertaining to specific university departments are pushed

lower down, while pages with relevant theory get a better

rank after considering the contextual factor.

V. CONCLUSIONS AND FUTURE WORK

In this paper, we presented an efficient and sound

method for implicit client-side user profiling. The usage

of the DMOZ ontology for creating 'part-profiles' ensures

effective contextualization of the user's interests, and

enriches the vocabulary of the profiles sufficiently. Using

a human-edited wealth of information also ensures that

we don't face the problem of a 'cold-start' creating an

impoverished initial part-profile.

Our method builds upon previous work in this domain

by optimizing the categorization algorithm using the

unique nature of the ODP ontology. This ensures speed

and accuracy in categorization of a user's active interests

in a search session. The single-document keyword

extraction algorithm presented in Section 3.1 fits

perfectly into the infrastructure, ensuring that every web

page visited by the user contributes to his model, in the

short-term and long-term sense. We forgo the need to

define logical sessions of the user's work, which is the

basis of a lot of related work in literature, focusing

instead on making every action taken by him contribute

to the relevant part-profile.

One possible optimization to our method would be to

have a global repository for vector representations of

popular web-pages, so that the keyword extraction

algorithm doesn't have to be run again and again on

every client's machine for the same version of a web

page. The bag-of-words vector representations of the

user's part-profiles can also be used for other web-search

personalization tasks such as search query auto-

completion, automated spell-checking, etc.

Currently, active work is being done to ensure better

processing of formatted web data so that only relevant

text is processed by the keyword extraction algorithm.

We also plan to integrate our work into a web-browser as

a plug-in, to enable seamless integration of our

techniques into the search process. To provide greater

power to the user, we plan to give him the option to view

his individual part-profiles, and the ability to 'switch-on',

enable, and 'switch-off‟, disable, any of them, helping

customize the search experience even more [2].

One possible short-coming of our approach would be

the language barrier, since the ODP is only an English-

based ontology. This can however be remedied using

Natural Language Processing techniques to translate the

individual document vectors into those of English words,

so that they can be mapped to the relevant DMOZ

categories. In this sense too, the development of a global

collection of web-page vectors would be beneficial.

Thus, our work has many possible applications in the

field of web search personalization, apart from result re-

ranking, and its flexibility ensures that it can fit the needs

of every domain of study very well.

REFERENCES

[1] The Open Directory Project (ODP) (2006), Available

online http://dmoz.org.

[2] Pitkow, J. et al. 2002. Personalized search.

Communications of the ACM 45, 9, pp. 50–55.

[3] Shen, X., Tan, B., and Zhai, C. X. (2005). “Implicit

User Modeling for Personalized Search,” in Proceedings

of the 14th ACM International Conference on

Information and Knowledge Management. Bremen,

Germany, pp. 824–831.

[4] Tan, A. H. and Teo, C. (1998). “Learning User

Profiles for Personalized Information Dissemination,” in

Proceedings of International Joint Conference on Neural

Network. Anchorage, AK, 183–188.

[5] Gauch, S., Chaffee, J., and Pretschner, A. 2003.

“Ontology-Based Personalized Search and Browsing

Web Intelligence and Agent Systems,” 1, 3/4, pp. 219–

234.

[6] Gruber, Toward. (1995) “Principles for the Design of

Ontologies Used for Knowledge Sharing,” International










Journal of Human-Computer Studies (43:5-6), 1995, pp.

907-928.

[7] Porter, M.F.( 1980). “An Algorithm for Suffix

Stripping,” (14:3), 1980, pp. 130-137.

[8] Widyantoro, H, Ioerger, T & Yen J. (2000).

“Learning User Interest Dynamics with a Three-

Descriptor Representation,” in Journal of the American

Society for Information Science, 52(3), pp. 212-225.

[9] Sachin Singh, Pravin Vajirkar, and Yugyung Lee.

(2003). “Context-aware Data Mining using Ontologies,”

in the 14th international conference proceeding

DEXATM (Database and EXpert systems Applications)

[10] Kageura, K., and Umino, B. (1996). “Methods of

Automatic Term Recognition Terminology,” 3(2):259.

[11] Vishnu Challam, Susan Gauch, Aravind

Chandramouli (2007). “Contextual Search Using

Ontology-Based User Profiles,” RIAO 2007

[12] Yutaka Matsuo, Mitsuru Ishizuka. (2004).

“Keyword Extraction from a Single Document Using

Word Co-occurrence Statistical Information,”

International Journal on Artificial Intelligence Tools

13(1): pp. 157-169 (2004).

[13] Sellen, A. J., Murphy, R., and Shaw, K. L. (2002).

“How Knowledge Workers Use the Web,” in

Proceedings of the SIGCHI Conference on Human

Factors in Computing Systems: Changing our World,

Changing Ourselves. Minneapolis, MN, pp. 227–234.

[14] Lawrence, S. (2000). “Context in Web search,”

IEEE Data Engineering Bulletin, 23(3), pp. 25–32.

[15] D. Kelly and J. Teevan. (2003) “Implicit Feedback

for Inferring User Preference: A bibliography,” SIGIR

Forum, 37(2), pp.18–28, 2003.

[16] Genesereth, M. R., & Nilsson, N. J. (1987). Logical

Foundations of Artificial Intelligence. San Mateo, CA:

Morgan Kaufmann Publishers.

[17] Zhongming Ma, Gautam Pant, Olivia R. Liu Sheng.

“Interest-Based Personalized Search,” ACM Trans. Inf.

Syst. 25(1) (2007)

[18] Jansen, B. J. and Spink, A. (2005). “An Analysis of

Web Searching by European AlltheWeb.com Users,”

Information Processing and Management 41, pp. 361–

381.

[19] Z. Ma, O. R. L. Sheng, G. Pant. (2005) “Evaluation

of Ontology-based User Interests Modeling,” 4th

Workshop on e-Business, 2005.

[20] Liu, F., Yu, C., and Meng, W. (2004). “Personalized

Web search for Improving Retrieval Effectiveness,”

IEEE Trans. Knowledge Data Engineering, 16(1), pp.

28–40.

[21] Fürnkranz, J. (1998). “A Study using n-grams

Features for Text Categorization,” Technical Report

OEFAI-TR-98-30, Austrian Research Institute for

Artificial Intelligence.

[22] Tan, Pang-Ning; Steinbach, Michael; Kumar, Vipin

(2005), Introduction to Data Mining, ISBN 0-321-32136-

7

http://www.informatik.uni-trier.de/~ley/pers/hd/g/Gauch:Susan.html






https://www.researchgate.net/publication/2437024_Learning_User_Interest_Dynamics_with_a_Three-Descriptor_Representation?el=1_x_8&enrichId=rgreq-1bb3bcc544da6cb11b6f2a0a312d505c-XXX&enrichSource=Y292ZXJQYWdlOzI1OTQ3ODkyNjtBUzoxMDQwMjc2NjQ2ODMwMTBAMTQwMTgxMzUzNjE4Mw==













https://www.researchgate.net/publication/2572200_Keyword_Extraction_from_a_Single_Document_using_Word_Co-occurrence_Statistical_Information?el=1_x_8&enrichId=rgreq-1bb3bcc544da6cb11b6f2a0a312d505c-XXX&enrichSource=Y292ZXJQYWdlOzI1OTQ3ODkyNjtBUzoxMDQwMjc2NjQ2ODMwMTBAMTQwMTgxMzUzNjE4Mw==







Documents

Tree Based Ontological User Profiling for Web Search Optimization