Upload
others
View
30
Download
0
Embed Size (px)
Citation preview
From User Profiling to Corpus Profiling
Nikolaos NanasCE.RE.TE.TH. (www.cereteth.gr)CEntre for REsearch and TEchnology THessaly, GreeceImtronics Institute
LISyS Lab for Information Systems and Serviceswww.lisys.gr
profiling TIPSTER with Nootropia
outline
• Corpus Profiling• the need• the requirements
•User Profiling• Information Filtering• problem casting
• Nootropia• the model• construction process• feature extraction
•Profiling TIPSTER•the dataset•experiments
•Future Work
“ [...] any two [optimization] algorithms are equivalent when their performance is averaged across all possible problems. “
“no free lunch theorem”
no Free Lunch
Wolpert and Macready
“[...] improvement of performance in problem-solving hinges on using prior information to match procedures to problems.”
document collections
buying lunch
• characterise corpora• map methods to characteristics• compare corpora
CP requirements
• automatic• objective• quantitative• continuous
Information Filtering (IF) is concerned with the problem of providing a user with relevant information, based on a tailored representation of the userʼs interests, called “profile”.
Content-based document filtering, in particular, deals with text documents and in this case the user profile is an abstraction directly derived from their content.
Scene from Terry Gilliam’s “Brazil”
Information Filtering
the User Profile
USER
US
ER
PR
OFI
LE
User Feedback
A User Profiling model can be used to construct a profile out of the documents in a corpus.
• mine the profile• characterise corpus• compare corpora
build a profile for each document collection
from User to Corpus
A User Profile is built out of features that are automatically extracted from the content of information items. In the case of text, features correspond to words or more generally terms.
• content characterisation• dimensionality reduction features are typically keywords extracted from text
feature extraction
IF is traditionally approached with user profiling models that treat documents as “bag of words” and ignore any syntactic or semantic correlations between terms. a bag of words
bag of words
According to Saltonʼs Vector Space Model both documents and queries (and traditionally profiles as well) are represented as vectors in a multi dimensional space, with as many dimensions as the number of unique words in the indexed document collection.
a bag of words
Vector Space
• A weighted keyword vector can only provide us with information about differences in the distribution of words in text.
• It treats a document collection as a big bag containing small bags of words.
a bag of words
Vector Space
• A weighted keyword vector can only provide us with information about differences in the distribution of words in text.
• It treats a document collection as a big bag containing small bags of words.
a bag of words
Vector Space
The Corpus Profile should bea rich information structure.
• weighted feature network• non-linear document evalution
• documents are not just bag of words
nootropia
Nootropia is an immune-inspired, user profiling model for adaptive information filtering.
the user profile
0.40.3
0.5
term
link
0.10.2
the user profile
the profile is built in three steps...
0.9
0.5
0.20.3
0.90.8
0.4
0.8
0.4
0.9
0.2
0.3
0.5
0.3
0.8
0.2
0.4
0.5
relevant documents weighted terms connected terms ordered terms
the user profile
the profile’s structure provides additional information
0.40.3
0.5
term
link
0.1
0.2
network statistics
•number of terms•number of links•links per term•maximum term weight•maximum link weight•average term weight•average link weight
•number of dominants•average dom. weight•average links per dom.•av. number of actives•av. number of leafs•ov. weight of actives /dom
General Features Dominant Features
TIPSTER collection
• Short abstracts from the Department of Energy (DOE).• News stories from the Associate Press (AP).• Whole issues of the Federal Register (FR). • News stories from the San Jose Mercury News (SJM).• News stories from the Wall Street Journal (WSJ).• Material from Ziff-Davis Publishing Co. (ZF).• U.S. Patent documents (PAT).
includes:
profiling TIPSTER
DOE AP FR PAT
experimental settingswindow sizefeature extractionlink thresholdaggregate correlation
5 300t 3 0.58275 thr 3 0.7654
10 300t 3 0.70510 thr 3 0.83435 300t 50 0.6085 thr 50 0.7576
10 300t 50 0.611410 thr 50 0.75625 300t 100 0.60515 thr 100 0.7405
10 300t 100 0.56710 thr 100 0.72625 300t 200 0.6195 thr 200 0.7088
10 300t 200 0.61810 thr 200 0.69
general statistics
number number links maximum maximum average averageof terms of links per term term weight link weight term weight link weight
DOE 300 39340 262.27 0.19967 +2.9281E-05 0.04138 +5.5180E-07 AP 300 4881 32.54 0.40588 +1.3930E-05 0.09933 +4.2467E-07 FR 300 17900 119.33 0.9238 +2.9667E-05 0.22138 +5.1248E-07 SJM 300 30563 203.75 0.99979 +5.9866E-05 0.11613 +5.1458E-07 WSJ 300 30122 200.81 0.59734 +6.2984E-05 0.11184 +5.4529E-07 ZF 300 5001 33.34 0.70814 +2.9681E-05 0.12127 +6.1881E-07 PAT 300 34402 229.35 0.92313 +2.1159E-04 0.39544 +7.5657E-07
dominant statistics
number average average av. number ofav. number av. numberov. act. weightof dominantsdom. weight link weight links per dom.of actives of leafs per dominant
DOE 1 0.19967 +2.8597E-07 298 300 2 12.41372AP 14 0.25897 +2.4710E-07 20.79 261.29 7 22.41597FR 1 0.9238 +7.0922E-07 45 300 4 66.41486SJM 1 0.99979 +8.3597E-07 27 300 1 34.83854WSJ 2 0.53216 +4.0246E-07 151 297.5 2 32.28973ZF 6 0.42194 +3.6498E-07 37.33 272.33 10 27.53403PAT 2 0.90349 +4.5361E-07 219 298 2 116.85551
dominant termsnumber number number average
weight of links of actives of leafsact. weightDOEresults 0.1997 289 300 2 0.0414APest 0.4059 4 287 6 0.0926year 0.3807 32 287 6 0.0925press 0.3797 27 287 6 0.0925edt 0.356 5 286 6 0.0914people 0.3051 24 286 6 0.0912years 0.2787 18 279 6 0.0888president 0.259 29 275 6 0.0877government 0.2434 33 275 6 0.0876state 0.2215 34 275 6 0.0875made 0.1944 11 259 6 0.0826states 0.1877 17 259 6 0.0826eds 0.1748 21 257 6 0.0819high 0.1321 14 222 6 0.0748west 0.1195 14 180 6 0.0694FRcode 0.9238 24 300 3 0.2214
SJMeng 0.9998 6 300 1 0.1161WSJwall 0.5973 74 299 3 0.1107million 0.467 180 295 3 0.1063ZFcopyright 0.7081 26 292 11 0.1151topic 0.6775 50 283 11 0.1048company 0.4938 26 278 11 0.0991system 0.3818 23 274 11 0.0954compatible 0.276 54 273 11 0.0945users 0.264 31 269 11 0.0927feature 0.2638 25 269 11 0.0927user 0.2372 20 238 11 0.0817includes 0.1694 34 236 11 0.0807PATinvention 0.9231 203 299 2 0.3938claim 0.8838 166 294 2 0.3859
cosine similarity
FR PAT ZF WSJ DOE AP SJM FR 1 0.89 0.856 0.903 0.691 0.723 0.957PAT 0.89 1 0.727 0.921 0.78 0.621 0.867ZF 0.856 0.727 1 0.778 0.603 0.912 0.755WSJ 0.903 0.921 0.778 1 0.921 0.693 0.932DOE 0.691 0.78 0.603 0.921 1 0.55 0.749AP 0.723 0.621 0.912 0.693 0.55 1 0.638SJM 0.957 0.867 0.755 0.9325 0.7495 0.638 1
summary & future directions
•from User to Corpus Profiling•Nootropia
• captures term dependencies• rich information structure• profile can be mined
• Corpus Profiling• identify and extract informative features• characterise corpora• compare corpora
•Map methods to features•assess validity•produce a continuous corpus profiling mechanism
Thank [email protected]
questions