From User Profiling to Corpus Profilingkmi.open.ac.uk/.../invited/CoprusProfilingWorkshop-Nanas.ppt · From User Profiling to Corpus Profiling Nikolaos Nanas CE.RE.TE.TH. () CEntre

From User Profiling to Corpus Profiling

Nikolaos NanasCE.RE.TE.TH. (www.cereteth.gr)CEntre for REsearch and TEchnology THessaly, GreeceImtronics Institute

LISyS Lab for Information Systems and Serviceswww.lisys.gr

profiling TIPSTER with Nootropia

outline

• Corpus Profiling• the need• the requirements

•User Profiling• Information Filtering• problem casting

• Nootropia• the model• construction process• feature extraction

•Profiling TIPSTER•the dataset•experiments

•Future Work

“ [...] any two [optimization] algorithms are equivalent when their performance is averaged across all possible problems. “

“no free lunch theorem”

no Free Lunch

Wolpert and Macready

“[...] improvement of performance in problem-solving hinges on using prior information to match procedures to problems.”

document collections

buying lunch

• characterise corpora• map methods to characteristics• compare corpora

CP requirements

• automatic• objective• quantitative• continuous

Information Filtering (IF) is concerned with the problem of providing a user with relevant information, based on a tailored representation of the userʼs interests, called “profile”.

Content-based document filtering, in particular, deals with text documents and in this case the user profile is an abstraction directly derived from their content.

Scene from Terry Gilliam’s “Brazil”

Information Filtering

the User Profile

USER

US

ER

PR

OFI

LE

User Feedback

A User Profiling model can be used to construct a profile out of the documents in a corpus.

• mine the profile• characterise corpus• compare corpora

build a profile for each document collection

from User to Corpus

A User Profile is built out of features that are automatically extracted from the content of information items. In the case of text, features correspond to words or more generally terms.

• content characterisation• dimensionality reduction features are typically keywords extracted from text

feature extraction

IF is traditionally approached with user profiling models that treat documents as “bag of words” and ignore any syntactic or semantic correlations between terms. a bag of words

bag of words

According to Saltonʼs Vector Space Model both documents and queries (and traditionally profiles as well) are represented as vectors in a multi dimensional space, with as many dimensions as the number of unique words in the indexed document collection.

a bag of words

Vector Space

• A weighted keyword vector can only provide us with information about differences in the distribution of words in text.

• It treats a document collection as a big bag containing small bags of words.

a bag of words

Vector Space

• A weighted keyword vector can only provide us with information about differences in the distribution of words in text.

• It treats a document collection as a big bag containing small bags of words.

a bag of words

Vector Space

The Corpus Profile should bea rich information structure.

• weighted feature network• non-linear document evalution

• documents are not just bag of words

nootropia

Nootropia is an immune-inspired, user profiling model for adaptive information filtering.

the user profile

0.40.3

0.5

term

link

0.10.2

the user profile

the profile is built in three steps...

0.9

0.5

0.20.3

0.90.8

0.4

0.8

0.4

0.9

0.2

0.3

0.5

0.3

0.8

0.2

0.4

0.5

relevant documents weighted terms connected terms ordered terms

the user profile

the profile’s structure provides additional information

0.40.3

0.5

term

link

0.1

0.2

network statistics

•number of terms•number of links•links per term•maximum term weight•maximum link weight•average term weight•average link weight

•number of dominants•average dom. weight•average links per dom.•av. number of actives•av. number of leafs•ov. weight of actives /dom

General Features Dominant Features

TIPSTER collection

• Short abstracts from the Department of Energy (DOE).• News stories from the Associate Press (AP).• Whole issues of the Federal Register (FR). • News stories from the San Jose Mercury News (SJM).• News stories from the Wall Street Journal (WSJ).• Material from Ziff-Davis Publishing Co. (ZF).• U.S. Patent documents (PAT).

includes:

profiling TIPSTER

DOE AP FR PAT

experimental settingswindow sizefeature extractionlink thresholdaggregate correlation

5 300t 3 0.58275 thr 3 0.7654

10 300t 3 0.70510 thr 3 0.83435 300t 50 0.6085 thr 50 0.7576

10 300t 50 0.611410 thr 50 0.75625 300t 100 0.60515 thr 100 0.7405

10 300t 100 0.56710 thr 100 0.72625 300t 200 0.6195 thr 200 0.7088

10 300t 200 0.61810 thr 200 0.69

general statistics

number number links maximum maximum average averageof terms of links per term term weight link weight term weight link weight

DOE 300 39340 262.27 0.19967 +2.9281E-05 0.04138 +5.5180E-07 AP 300 4881 32.54 0.40588 +1.3930E-05 0.09933 +4.2467E-07 FR 300 17900 119.33 0.9238 +2.9667E-05 0.22138 +5.1248E-07 SJM 300 30563 203.75 0.99979 +5.9866E-05 0.11613 +5.1458E-07 WSJ 300 30122 200.81 0.59734 +6.2984E-05 0.11184 +5.4529E-07 ZF 300 5001 33.34 0.70814 +2.9681E-05 0.12127 +6.1881E-07 PAT 300 34402 229.35 0.92313 +2.1159E-04 0.39544 +7.5657E-07

dominant statistics

number average average av. number ofav. number av. numberov. act. weightof dominantsdom. weight link weight links per dom.of actives of leafs per dominant

DOE 1 0.19967 +2.8597E-07 298 300 2 12.41372AP 14 0.25897 +2.4710E-07 20.79 261.29 7 22.41597FR 1 0.9238 +7.0922E-07 45 300 4 66.41486SJM 1 0.99979 +8.3597E-07 27 300 1 34.83854WSJ 2 0.53216 +4.0246E-07 151 297.5 2 32.28973ZF 6 0.42194 +3.6498E-07 37.33 272.33 10 27.53403PAT 2 0.90349 +4.5361E-07 219 298 2 116.85551

dominant termsnumber number number average

weight of links of actives of leafsact. weightDOEresults 0.1997 289 300 2 0.0414APest 0.4059 4 287 6 0.0926year 0.3807 32 287 6 0.0925press 0.3797 27 287 6 0.0925edt 0.356 5 286 6 0.0914people 0.3051 24 286 6 0.0912years 0.2787 18 279 6 0.0888president 0.259 29 275 6 0.0877government 0.2434 33 275 6 0.0876state 0.2215 34 275 6 0.0875made 0.1944 11 259 6 0.0826states 0.1877 17 259 6 0.0826eds 0.1748 21 257 6 0.0819high 0.1321 14 222 6 0.0748west 0.1195 14 180 6 0.0694FRcode 0.9238 24 300 3 0.2214

SJMeng 0.9998 6 300 1 0.1161WSJwall 0.5973 74 299 3 0.1107million 0.467 180 295 3 0.1063ZFcopyright 0.7081 26 292 11 0.1151topic 0.6775 50 283 11 0.1048company 0.4938 26 278 11 0.0991system 0.3818 23 274 11 0.0954compatible 0.276 54 273 11 0.0945users 0.264 31 269 11 0.0927feature 0.2638 25 269 11 0.0927user 0.2372 20 238 11 0.0817includes 0.1694 34 236 11 0.0807PATinvention 0.9231 203 299 2 0.3938claim 0.8838 166 294 2 0.3859

cosine similarity

FR PAT ZF WSJ DOE AP SJM FR 1 0.89 0.856 0.903 0.691 0.723 0.957PAT 0.89 1 0.727 0.921 0.78 0.621 0.867ZF 0.856 0.727 1 0.778 0.603 0.912 0.755WSJ 0.903 0.921 0.778 1 0.921 0.693 0.932DOE 0.691 0.78 0.603 0.921 1 0.55 0.749AP 0.723 0.621 0.912 0.693 0.55 1 0.638SJM 0.957 0.867 0.755 0.9325 0.7495 0.638 1

summary & future directions

•from User to Corpus Profiling•Nootropia

• captures term dependencies• rich information structure• profile can be mined

• Corpus Profiling• identify and extract informative features• characterise corpora• compare corpora

•Map methods to features•assess validity•produce a continuous corpus profiling mechanism

Thank [email protected]

questions

Documents

From User Profiling to Corpus Profilingkmi.open.ac.uk/.../invited/CoprusProfilingWorkshop-Nanas.ppt · From User Profiling to Corpus Profiling Nikolaos Nanas CE.RE.TE.TH. () CEntre