Upload
tommy96
View
1.327
Download
0
Tags:
Embed Size (px)
Citation preview
1
Machine Learning Methodsfor
Text / Web Data Mining
Byoung-Tak ZhangSchool of Computer Science and Engineering
Seoul National UniversityE-mail: [email protected]
This material is available at http://scai.snu.ac.kr./~btzhang/
2
Overview
Introduction4Web Information Retrieval 4Machine Learning (ML)4ML Methods for Text/Web Data Mining
Text/Web Data Analysis4Text Mining Using Helmholtz Machines4Web Mining Using Bayesian Networks
Summary 4Current and Future Work
2
3
Web Information RetrievalText Data
Classification System
Information Filtering System
questionuser profile
feedback
answer
DB
LocationDate
DB Record
DB Template Filling& InformationExtraction System
Information FilteringInformation Extraction
filtered data
Text Classification
Preprocessing and Indexing
4
Machine Learning
Supervised Learning4Estimate an unknown mapping from known input-
output pairs4Learn fw from training set D={(x,y)} s.t.4Classification: y is discrete4Regression: y is continuous
Unsupervised Learning4Only input values are provided4Learn fw from D={(x)} s.t.4Density Estimation4Compression, Clustering
)()( xxw fyf ==
xxw =)(f
3
5
Machine Learning MethodsNeural Networks4Multilayer Perceptrons (MLPs)4Self-Organizing Maps (SOMs)4Support Vector Machines (SVMs)
Probabilistic Models4Bayesian Networks (BNs)4Helmholtz Machines (HMs)4Latent Variable Models (LVMs)
Other Machine Learning Methods4Evolutionary Algorithms (EAs)4Reinforcement Learning (RL)4Boosting Algorithms4Decision Trees (DTs)
6
ML for Text/Web Data Mining
Bayesian Networks for Text ClassificationHelmholtz Machines for Text Clustering/CategorizationLatent Variable Models for Topic Word ExtractionBoosted Learning for TREC Filtering TaskEvolutionary Learning for Web Document RetrievalReinforcement Learning for Web Filtering AgentsBayesian Networks for Web Customer Data Mining
4
7
Preprocessing for Text LearningFrom: [email protected]:comp.graphicsSubject: Need specs on Apple QT
I need to the specs, or at least a very verbose interpretation of the specs, for QuckTime. Technical articles from magazines and references to books would be nice, too.
I also need the specs in a format usable on a Unix or MS-Dos system. I can’t do much with the QuickTime stuff they have on..
specs3unix1
hockey0
computer0
.
space0references1
.
.quicktime2
graphics0
clinton0car0baseball0
8
Text Mining: Data Sets
Usenet Newsgroup Data420 categories 41000 documents for each category 420000 documents in total.
TDT2 Corpus4Target detection and tracking (TDT): NIST4Used 6,169 documents in experiments
5
9
d1 d2 d3 dn
h1 h2 hm
…
…
4 [Chang and Zhang, 2000]4 Input nodes
• Binary values• Represent the existence or
absence of words in documents.
Text Mining:Helmholtz Machine Architecture
−−+
==
∑=
n
jjiji
i
dwbhP
1exp1
1)1(
−−+
==
∑=
m
jjiji
i
hwbdP
1exp1
1)1(
4Latent nodes• Binary values• Extract the underlying causal
structure in the document set. • Capture correlations of the words
in documents.
: recognition weight
: generative weight
10
Text Mining: Learning Helmholtz Machines
4Introduce a recognition network for estimation of a generative network.
4Wake-Sleep Algorithm• Train the recognition and generative models alternately.• Update the weight in network iteratively by simple local delta rule.
( ) ( ) ( )( )
( ) ( )( )∑∑
∑ ∑∑ ∑
=
==
≥
=
=
T
tt
ttt
tt
ttt
T
t t
tt
t QdPQ
QdPQdPD
1)(
)()()(
T
1t )()(
)()()(
1 )(
)()(
)(
|,log
|,log |,log)|log(
α
αα
αθαα
αθααθαθ
))1(( =−=∆
∆+=
jjiij
ijoldij
newij
spssw
www
γ
6
11
Text Mining: Methods
Text Categorization4Train a Helmholtz machine for each category.4Total N machines for N categories.4Once the N machines have been estimated, classification of a test
document proceeds by estimating the likelihood of the document for each machine.
Topic Words Extraction4For the entire document sets, train a Helmholtz machine.4After training, examine the weights of connections from a latent
node to input nodes, that is words.
)]|([logmaxargˆ cdPcCc∈
=
12
4Usenet Newsgroup Data• 20 categories, 1000 documents for each category, 20000 documents in
total.
Text Mining: Categorization Results
7
13
4TDT2 Corpus46,169 documents
imf, monetary, currencies, currency, rupiah, singapore, bailout, traders, markets, thailand, inflation, investors, fund, banks, baht
7
netanyahu, palestinian, arafat, israeli, yasser, kofi, annan, benjamin, palestinians, mideast, gaza,jerusalem, eu, paris, israel
4
India, pakistan, pakistani, delhi, hindu, vajpayee, nuclear, tests, atal, kashmir, indian, janata, bharatiya, islamabad, bihari
5
Suharto, habibie, demonstrators, riots, indonesians, demonstrations, soeharto, resignation, jakarta, rioting, electoral, rallies, wiranto, unrest, megawati
6
pope, cuba, cuban, embargo, castro, lifting, cubans, havana, alan, invasion, reserve, paul, output, vatican, freedom
8
olympics, nagano, olympic, winter, medal, hockey, atheletes, cup, games, slalom, medals, bronze, skating, lillehammer, downhill
3
warplane, airline, saudi, gulf, wright, soldiers, yitzhak, tanks, stealth, sabah, stations, kurds, mordechai, separatist, governor
2
tabacco, smoking, gingrich, newt, trent, republicans, congressional, republicans, attorney, smokers, lawsuit, senate, cigarette, morris, nicotine
1
Text Mining: Topic Words Extraction Results
14
KDD-2000 Web Mining Competition4Data: 465 features over 1700 customers
• Features include friend promotion rate, date visited, weight of items, price of house, discount rate, …
• Data was collected during Jan. 30 – March 30, 2000• Friend promotion was started from Feb. 29 with TV
advertisement. 4Aims: Description of heavy/low spenders
Web Mining: Customer Analysis
8
15
16
9
17
Web Mining: Feature SelectionFeatures selected by various ways [Yang & Zhang, 2000]
V240 (Friend) V229 (Order-Average) V304 (OrderShippingAmtMin.)V368 (Weight Average)V43 (Home Market Value)V377 (NumAcountTemplate Views)+V11 (WhichDoYouWearMostFrequent)V13 (SendEmail)V17 (USState)V45 (VehicleLifeStyle)V68 (RetailActivity)V19 (Date)
V13 (SendEmail)V234 (OrderItemQuantity Sum%HavingDiscountRange(5 . 10))V237 (OrderItemQuantitySum%Having DiscountRange(10.))V240 (Friend)V243 (OrderLineQuantitySum)V245 (OrderLineQuantity Maximum)V304 (OrderShippingAmtMin)V324 (NumLegwearProductViews)V368 (Weight Average)V374 (NumMainTemplateViews)V412 (NumReplenishableStock Views)
V368 (Weight Average)V243 (OrderLine Quantity Sum)V245 (OrderLine Quantity Maximum)F1 = 0.94*V324 + 0.868*V374 + 0.898*V412 F2 = 0.829*V234 + 0.857*V240F3 = -0.795*V237+ 0.778*V304
Discriminant ModelDecision TreeDecisionTree+Factor Analysis
18
Web Mining: Bayesian Nets
Bayesian network4DAG (Directed Acyclic Graph) 4Express dependence relations between variables 4Can use prior knowledge on the data (parameters)
A B C P(A,B,C,D,E) = P(A)P(B|A)P(C|B)
P(D|A,B)P(E|B,C,D)
D E
4Examples of conjugate priors:Dirichlet for multinomial data, Normal-Wishart for normal data
10
19
Web Mining: Results
A Bayesian net for KDD web data
V229 (Order-Average) and V240 (Friend) directly influence V312 (Target)
V19 (Date) was influenced by V240 (Friend) reflecting the TV advertisement.
20
SummaryWe study machine learning methods, such as4Probabilistic neural networks4Evolutionary algorithms4Reinforcement learning
Application areas include4Text mining4Web mining4Bioinformatics (not addressed in this talk)
Recent work focuses on probabilistic graphical models for web/text/bio data mining, including4Bayesian networks4Helmholtz machines4Latent variable models
11
21
22
Bayesian Networks:Architecture
A Bayesian network represents the probabilistic relationships between the variables.
B
G M
L
∏=
=n
iiiXPP
1
)|()( paX pai is the set of parent nodes of Xi.
),|()|()()( ),,|(),|()|()(),,,(
LBMPBGPBPLPGBLMPBLGPLBPLPMGBLP
==
12
23
The network structure represents the naïve Bayes assumption.All nodes are binary.[Hwang & Zhang, 2000]
C
t1 t2 t8754
• C: document class
• ti: ith term
Bayesian Networks:Applications in IR – A Simple BN for Text Classification
24
Dataset4The acq dataset from Reuters-2157848754 terms were selected by TFIDF.4Training data: 8762 documents4Test data: 3009 documents
Parametric Learning4Dirichlet prior assumptions for the network parameter
distributions.
4Parameter distributions are updated with training data.),...,|(Dir)|( 1 iijrijij
hij Sp ααθθ =
),...,|(Dir),|( 11 ii ijrijrijijijh
ij NNSDp ++= ααθθ
Bayesian Networks:Experimental Results
13
25
For training data4Accuracy: 94.28%
For test data4Accuracy: 96.51%
99.3293.76Negative examples75.9896.83Positive examples
Precision (%)Recall (%)
98.6796.88Negative examples89.1795.16Positive examples
Precision (%)Recall (%)
Bayesian Networks:Experimental Results
26
Latent Variable Models:Architecture
Latent Variable Model forTopic Words Extraction and Document Clustering
[Shin & Zhang, 2000]Maximize log-likelihood
4Update , , . 4With EM Algorithm
∑∑ ∑
∑∑
= = =
= =
=
=
N
n
M
m
K
kknkmkmn
N
n
M
mmnmn
zdPzwPzPwdn
wdPwdnL
1 1 1
1 1
)|()|()(log),(
),(log),(
)( kzP )|( km zwP )|( kn zdP
Document Clustering
Topic-Words Extraction
14
27
EM (Expectation-Maximization) Algorithm4Algorithm to maximize pre-defined log-likelihood
Iteration of E-Step and M-Step4E-Step M-Step
∑=
= K
kkmknk
kmknk
mnk
zwPzdPzP
zwPzdPzPwdzP
1)|()|()(
)|()|()(),|(
∑∑
∑
= =
== M
m
N
nmnkmn
N
nmnkmn
km
wdzPwdn
wdzPwdnzwP
1 1
1
),|(),(
),|(),()|(
∑∑
∑
= =
== M
m
N
nmnkmn
M
mmnkmn
kn
wdzPwdn
wdzPwdnzdP
1 1
1
),|(),(
),|(),()|(
∑∑
∑∑
= =
= =
≡
=
M
m
N
nmn
M
m
N
nmnkmnk
wdnR
wdzPwdnR
zP
1 1
1 1
),(
,),|(),(1)(
Latent Variable Models:Learning
28
Topic Words Extraction and Document Clustering with a subset of TREC-8 dataTREC-8 adhoc task data4Documents: DTDS, FR94, FT, FBIS, LATIMES4Topics: 401-450 (401, 434, 439, 450)4401: Foreign Minorities, Germany4434: Estonia, Economy4439: Inventions, Scientific discovery4450: King Hussein, Peace
Latent Variable Models:Applications in IR – Experimental Results
15
29
0.9900.729290003450 (293)0.9270.953920307439 (219)0.6860.996791023820434 (347)0.9300.9022001279401 (300)RecallPrecisionz1z3z4z2Topic (#Docs)
Label (assigned to zk with Maximum P(di|zk) )
jordan, peac, isreal, palestinian, king, isra, arab, meet, talk, husayn, agreem, presid, majesti, negoti, minist, visit, region, arafat, secur, peopl, east, washington, econom, sign, relat, jerusalem, rabin, syria, iraq, …
Cluster 1(z1)
research, technology, develop, mar, materi, system, nuclear, environment, electr, process, product, power, energi, countrol, japan, pollution, structur, chemic, plant, …
Cluster 3(z3)
percent, estonia, bank, state, privat, russian, year, enterprise, trade, million, trade, estonian, econom, countri, govern, compani, foreign, baltic, polish, loan, invest, fund, product, …
Cluster 4(z4)
german, germani, mr, parti, year, foreign, people, countri, govern, asylum, polit, nation, law, minist, europ, state, immigr, democrat, wing, social, turkish, west, east, member, attack, …
Cluster 2(z2 )
Extracted Topic Words (top 35 words with highest P(wj|zk) Topics
Latent Variable Models:Applications in IR – Experimental Results
30
Boosting: Algorithms
A general method of converting rough rules into a highly accurate prediction ruleLearning procedure4 Examine the training set4 Derive a rough rule (weak learner)4 Re-weight the examples in the training set, concentrating on the hard cases for
previous rules4 Repeat T times
Learner Learner Learner Learner
h1 h2 h3 h4),,,( 4321 hhhhf
Importance weightsof training documents
16
31
Boosting: Applied to Text Filtering
Naïve Bayes4Traditional algorithm for text filtering
Boosting naïve Bayes4Using naïve Bayes classifiers as weak learners4 [Kim & Zhang, SIGIR-2000]
)|""(
... )|""()|""()(maxarg
)|()(maxarg
)|()(maxarg
21
1
} ,{
jin
jijijc
n
kjikj
c
jijirrelevantrelevantc
NM
ctroublewP
capproachwPcourwPcP
cwPcP
cdPcPc
j
j
j
=
===
=
=
∏=
∈ Assume independence among terms
32
TREC (Text Retrieval Conference)4 Sponsored by NIST
TREC-7 filtering datasets4Training Documents
• AP articles (1988)• 237 MB, 79919 documents
4Test Documents• AP articles (1989~1990)• 471 MB, 162999 documents
4No. of topics: 50
Example of a document
TREC-8 filtering datasets4Training Documents
• Financial Times (1991~1992)• 167 MB, 64139 documents
4Test Documents• Financial Time (1993~1994)• 382 MB, 140651 documents
4No. of topics: 50
Boosting: Applied to Text Filtering – Experimental Results
17
33
0.509
PIRC
0.500
PIRC
0.505
NTTATTBoostingNTTATTBoosting
0.4600.467
Averaged Scaled F3
0.4520.4610.474
Averaged Scaled F1
0.720
Mer
0.714
PIRC
0.734
PIRCCLBoostingPLT2PLT1Boosting
0.7210.722
Averaged Scaled LF2
0.7130.7120.717
Averaged Scaled LF1
TREC-7
TREC-8
Compared with the state-of-the-art text filtering systems
Boosting: Applied to Text Filtering – Experimental Results
34
Evolutionary Learning:Applications in IR - Web-Document Retrieval
[Kim & Zhang, 2000]
Link Information,HTML Tag Information
chromosomes
<TITLE> <H> <B> … <A>
Retrieval
Fitness
wn…w3w2w1
wn…w3w2w1 wn…w3w2w1 wn…w3w2w1
wn…w3w2w1 wn…w3w2w1
18
35
Evolutionary Learning:Applications in IR – Tag Weighting
Crossover
Truncation selection
Mutation
yn…y3y2y1xn…x3x2x1
chromosome X chromosome Y
chromosome Z (offspring)
zi = (xi + yi ) / 2 w.p. Pc
chromosome X
chromosome X’
change value w.p. Pm
xn…x3x2x1
xn…x3x2x1zn…z3z2z1
36
Evolutionary Learning :Applications in IR - Experimental Results
Datasets4TREC-8 Web Track Data42GB, 247491 web documents (WT2g)4No. of training topics: 10, No. of test topics: 10
Results
19
37
Agent
Environment
Reinforcement Learning:Basic Concept
1. State st Reward rt
3. Reward rt+1
4. State st+1
2. Action at
38
WAIR
User
2. Actioni(modify profile)
Rewardi
1. Statei(user profile)
User profile
...Filtered documents
Document filtering
4. Statei+1
3. Rewardi+1 (relevance feedback)
•retrieve documents •calculate similarity
Reinforcement Learning:Applications in IR - Information Filtering[Seo & Zhang, 2000]
20
39
(%)
Reinforcement Learning:Experimental Results (Explicit Feedback)
40
(%)
Reinforcement Learning:Experimental Results (Implicit Feedback)