UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul

1

UOS

Ontology Based Personalized Search

Zhang TaoThe University of Seoul

Zhang Tao

Data Mining2

Contents

Overview1

Determining the content of documents2

User Profiles3

Improving Search Results4

Conclusions and Future Work5

Zhang Tao

Data Mining3

Overview

Proposing a problemWith the exponentially growing amount of information

available on the Internet, the task of retrieving documents of interest has become increasingly difficult.

People have two ways to find the data they are looking for: search and browse

In terms of searching, about one half of all retrieved documents have been reported to be irrelevant. Why?

Conclusion: How is the effective personalization system?

Zhang Tao

Data Mining4

Overview

The study of this paperThis paper studies ways to model a user’s interests and

shows how these profiles can be deployed for more effective information retrieval and filtering.

A user profile is created over time by analyzing surfed pages.

This paper shows how the profiles can be used to achieve search performance improvements.

Introduce the OBIWAN projectThe goal of OBIWAN is to investigate a novel content-

based approach to distributed information retrieval.Websites are clustered into regions.

Zhang Tao

Data Mining5

Overview

The architecture is a hierarchy of regions.The text classifier is a core component not only of the

entire OBIWAN project, but also of the presented personalization method.

Related WorkPersonalization is a broad field of very active ongoing

research.Applications include personalized access to certain

resources and filtering/rating systems.SmartPush is currently the only system to store profiles

as concept hierarchies.

Zhang Tao

Data Mining6

Determining the content of documents

ImportanceUser interests are inferred by analyzing the web pages

the user visits. For this purpose, it is necessary to determine the content,

or characterize of these surfed pages. A hierarchy of concepts

This ontology is based on a publicly accessible browsing hierarchy.

Each node is associated with a set of documents, all of documents for node are merged into a superdocument.

Documents as well as superdocuments are represented as weighted keyword vectors

Zhang Tao

Data Mining7

Determining the content of documents

This page vector is compared with the keyword vectors associated with every node to calculate similarities.

The nodes with the top matching vectors are assumed to be most related to the content of the surfed page.

Zhang Tao

Data Mining8

User Profiles

IntroduceUser profiles store approximations of the interests of a

given user.User profiles include three features:

• hierarchically structured, and not just a list of keywords• generated automatically, without explicit user feedback• Dynamical

Creation and MaintenanceProfiles are generated by analyzing the surfing behavior

of a user. “Surfing behavior” here refers to the length of the visited pages and the time spent thereon.

Zhang Tao

Data Mining9

User ProfilesFour different combinations of time, length, and subject

discriminators have been investigated.In the following function, time refers to the time a user

spent on a given page, and length refers to the length of the page, ɤ(d,ci) is the strength of the match between the content of document d and category ci. △L(ci) represents the interest L in a category ci.

(1) (2)( ) log ( , )

log log

timeL Ci r d Ci

length

( ) log ( , )log

timeL Ci r d Ci

length

Zhang Tao

Data Mining10

User Profiles

Profile Evaluation: ConvergenceThe evaluation of the user profiles consists of two parts:

• A notion of convergence is introduced with respect to which 16 actual user profiles are discussed.

• Examines the relationship between the calculated user interests and the actual user interests.

Figure 1 shows a sample profile (adjustment function 2), it consists of roughly 75 non-zero categories.

Figure 2 shows the numbers of non-zero categories for five sample profiles with 100-150 categories created using the same interest adjustment function.

Zhang Tao

Data Mining11

User Profiles

Zhang Tao

Data Mining12

User Profiles

Zhang Tao

Data Mining13

User Profiles

On average, that corresponds to roughly 320 pages, or 17 days of surfing. Table 1 summarizes the convergence properties.

Zhang Tao

Data Mining14

User Profiles

Comparison with actual user interestsAlthough convergence is a desirable property, it does not

measure the accuracy of the generated profiles.The sixteen users were shown the top twenty subjects in

their profiles in random order and asked how appropriately these inferred categories reflected their interests.

Table 2 shows the experiment for the answers to some questions with the top 20 and top 10 categories respectively.

Zhang Tao

Data Mining15

User Profiles

Zhang Tao

Data Mining16

Improving Search ResultsA problem about search results

The wealth of information available on the web is actually too large.

As to search results, the top ranked documents a user can have a look at are often not relevant to this user.

There are three common approaches to address this problem:

• Re-ranking: The algorithms apply a function to the ranking numbers that have been returned by the search engine.

• Filtering: Filtering systems determine which documents in the results sets are relevant and which are not.

• Query Expansion: If a query can be expanded with the user’s interests, the search results are likely to be more narrowly focused.

Zhang Tao

Data Mining17

Improving Search ResultsRe-Ranking

Given a query, re-ranking is done by modifying the ranking that was returned by a publicly accessible search engine.

ProFusion (www.profusion.com) in this case. The idea is to characterize each of the returned documents and, by referring to the user profiles, to determine how much a user is interested in these categories.

The following function is the adjustment function of the Re-ranking method.

4

1

1( ) ( ) (0.5 ( ) ( , ))

4 i

Q Dj w Dj Ci r Dj Ci

Zhang Tao

Data Mining18

Improving Search ResultsEvaluation

The results that have been produced by the different re-ranking systems must be evaluated.

The eleven point precision average is the better measure method.

The eleven point precision average evaluates ranking performance in terms of recall and precision.

Recall = Number of relevant items retrieved

Number of relevant items in collection

Precision = Number of relevant items retrieved

Total number of items retrieved

Zhang Tao

Data Mining19

Improving Search ResultsFigure 3 shows the recall-precision graphs for one

interest adjustment functions.Figure 4 shows The remaining set of 16 queries were

evaluated using this function.

Zhang Tao

Data Mining20

Improving Search Results

Zhang Tao

Data Mining21


Zhang Tao

Data Mining22

Improving Search ResultsFiltering

To filter a set of result documents means to exclude some documents.

Filtering was done by using the above ranking functions with thresholds to decide which documents were irrelevant and which were not.

Figures 5 and 6 show the performance of the filter for the training and the testing set, respectively.

Zhang Tao

Data Mining23


Zhang Tao

Data Mining24

Conclusion and Future WorkConclusion

These profiles have been shown to converge and to reflect actual user interests quite well.

With the presented approach, the length of a surfed page can be neglected when the interest in a page is inferred.

Future workFuture work includes the integration of the system into a

web browser.Other areas of profile deployment are conceivable.