Upload
gordon-goodman
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
1
UOS
Ontology Based Personalized Search
Zhang TaoThe University of Seoul
Zhang Tao
Data Mining2
Contents
Overview1
Determining the content of documents2
User Profiles3
Improving Search Results4
Conclusions and Future Work5
Zhang Tao
Data Mining3
Overview
Proposing a problemWith the exponentially growing amount of information
available on the Internet, the task of retrieving documents of interest has become increasingly difficult.
People have two ways to find the data they are looking for: search and browse
In terms of searching, about one half of all retrieved documents have been reported to be irrelevant. Why?
Conclusion: How is the effective personalization system?
Zhang Tao
Data Mining4
Overview
The study of this paperThis paper studies ways to model a user’s interests and
shows how these profiles can be deployed for more effective information retrieval and filtering.
A user profile is created over time by analyzing surfed pages.
This paper shows how the profiles can be used to achieve search performance improvements.
Introduce the OBIWAN projectThe goal of OBIWAN is to investigate a novel content-
based approach to distributed information retrieval.Websites are clustered into regions.
Zhang Tao
Data Mining5
Overview
The architecture is a hierarchy of regions.The text classifier is a core component not only of the
entire OBIWAN project, but also of the presented personalization method.
Related WorkPersonalization is a broad field of very active ongoing
research.Applications include personalized access to certain
resources and filtering/rating systems.SmartPush is currently the only system to store profiles
as concept hierarchies.
Zhang Tao
Data Mining6
Determining the content of documents
ImportanceUser interests are inferred by analyzing the web pages
the user visits. For this purpose, it is necessary to determine the content,
or characterize of these surfed pages. A hierarchy of concepts
This ontology is based on a publicly accessible browsing hierarchy.
Each node is associated with a set of documents, all of documents for node are merged into a superdocument.
Documents as well as superdocuments are represented as weighted keyword vectors
Zhang Tao
Data Mining7
Determining the content of documents
This page vector is compared with the keyword vectors associated with every node to calculate similarities.
The nodes with the top matching vectors are assumed to be most related to the content of the surfed page.
Zhang Tao
Data Mining8
User Profiles
IntroduceUser profiles store approximations of the interests of a
given user.User profiles include three features:
• hierarchically structured, and not just a list of keywords• generated automatically, without explicit user feedback• Dynamical
Creation and MaintenanceProfiles are generated by analyzing the surfing behavior
of a user. “Surfing behavior” here refers to the length of the visited pages and the time spent thereon.
Zhang Tao
Data Mining9
User ProfilesFour different combinations of time, length, and subject
discriminators have been investigated.In the following function, time refers to the time a user
spent on a given page, and length refers to the length of the page, ɤ(d,ci) is the strength of the match between the content of document d and category ci. △L(ci) represents the interest L in a category ci.
(1) (2)( ) log ( , )
log log
timeL Ci r d Ci
length
( ) log ( , )log
timeL Ci r d Ci
length
Zhang Tao
Data Mining10
User Profiles
Profile Evaluation: ConvergenceThe evaluation of the user profiles consists of two parts:
• A notion of convergence is introduced with respect to which 16 actual user profiles are discussed.
• Examines the relationship between the calculated user interests and the actual user interests.
Figure 1 shows a sample profile (adjustment function 2), it consists of roughly 75 non-zero categories.
Figure 2 shows the numbers of non-zero categories for five sample profiles with 100-150 categories created using the same interest adjustment function.
Zhang Tao
Data Mining11
User Profiles
Zhang Tao
Data Mining12
User Profiles
Zhang Tao
Data Mining13
User Profiles
On average, that corresponds to roughly 320 pages, or 17 days of surfing. Table 1 summarizes the convergence properties.
Zhang Tao
Data Mining14
User Profiles
Comparison with actual user interestsAlthough convergence is a desirable property, it does not
measure the accuracy of the generated profiles.The sixteen users were shown the top twenty subjects in
their profiles in random order and asked how appropriately these inferred categories reflected their interests.
Table 2 shows the experiment for the answers to some questions with the top 20 and top 10 categories respectively.
Zhang Tao
Data Mining15
User Profiles
Zhang Tao
Data Mining16
Improving Search ResultsA problem about search results
The wealth of information available on the web is actually too large.
As to search results, the top ranked documents a user can have a look at are often not relevant to this user.
There are three common approaches to address this problem:
• Re-ranking: The algorithms apply a function to the ranking numbers that have been returned by the search engine.
• Filtering: Filtering systems determine which documents in the results sets are relevant and which are not.
• Query Expansion: If a query can be expanded with the user’s interests, the search results are likely to be more narrowly focused.
Zhang Tao
Data Mining17
Improving Search ResultsRe-Ranking
Given a query, re-ranking is done by modifying the ranking that was returned by a publicly accessible search engine.
ProFusion (www.profusion.com) in this case. The idea is to characterize each of the returned documents and, by referring to the user profiles, to determine how much a user is interested in these categories.
The following function is the adjustment function of the Re-ranking method.
4
1
1( ) ( ) (0.5 ( ) ( , ))
4 i
Q Dj w Dj Ci r Dj Ci
Zhang Tao
Data Mining18
Improving Search ResultsEvaluation
The results that have been produced by the different re-ranking systems must be evaluated.
The eleven point precision average is the better measure method.
The eleven point precision average evaluates ranking performance in terms of recall and precision.
Recall = Number of relevant items retrieved
Number of relevant items in collection
Precision = Number of relevant items retrieved
Total number of items retrieved
Zhang Tao
Data Mining19
Improving Search ResultsFigure 3 shows the recall-precision graphs for one
interest adjustment functions.Figure 4 shows The remaining set of 16 queries were
evaluated using this function.
Zhang Tao
Data Mining20
Improving Search Results
Zhang Tao
Data Mining21
Improving Search Results
Zhang Tao
Data Mining22
Improving Search ResultsFiltering
To filter a set of result documents means to exclude some documents.
Filtering was done by using the above ranking functions with thresholds to decide which documents were irrelevant and which were not.
Figures 5 and 6 show the performance of the filter for the training and the testing set, respectively.
Zhang Tao
Data Mining23
Improving Search Results
Zhang Tao
Data Mining24
Conclusion and Future WorkConclusion
These profiles have been shown to converge and to reflect actual user interests quite well.
With the presented approach, the length of a surfed page can be neglected when the interest in a page is inferred.
Future workFuture work includes the integration of the system into a
web browser.Other areas of profile deployment are conceivable.