Upload
minerva-english
View
30
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Catalog Integration. R. Agrawal, R. Srikant: WWW-10. Catalog Integration Problem. Integrate products from new catalog into master catalog. The Problem (cont.). After integration:. Desired Solution. Automatically integrate products: little or no effort on part of user. domain independent. - PowerPoint PPT Presentation
Citation preview
Catalog Integration Problem
Integrate products from new catalog into master catalog.
a
ICs
LogicMem.DSP
fec db
ICs
Cat 2Cat 1
yx z
New CatalogMaster Catalog
Desired Solution
Automatically integrate products: little or no effort on part of user. domain independent. Problem size: Million products Thousands of categories
Basic Algorithm
Build classification model using product descriptions in master catalog.
Use classification model to predict categories for products in the new catalog.
Logic
DSPx
5%
95%
National Semiconductor Files
Part: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverPart_Id: DS14185 Manufacturer: nationalTitle: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverDescription: The DS14185 is a three driver, five receiver
device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment.Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage RegulatorPart: LM2940/LM2940C 1A Low Dropout Regulator
...
...
...
National Semiconductor Files with CategoriesPart: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category:
Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line
Part: LM3940 1A Low Dropout RegulatorPangea Category:
Choice 1: Positive Fixed Voltage RegulatorChoice 2: Voltage-Feedback Operational AmplifierChoice 3: Voltage ReferenceChoice 4: Voltage-Mode SMPS ControllerChoice 5: Positive Adjustable Voltage Regulator
...
...
Accuracy on Pangea Data
B2B Portal for electronic components: 1200 categories, 40K training
documents. 500 categories with < 5 documents. Accuracy: 72% for top choice. 99.7% for top 5 choices.
Enhanced Algorithm: Intuition
Use affinity information in the catalog to be integrated (new catalog):
Products in same category are similar. Bias the classifier to incorporate this
information. Accuracy boost depends on quality of new
catalog: Use tuning set to determine amount of bias.
Naive Bayes Classifier
Pr(Ci|d) = Pr(Ci)Pr(d|Ci)/Pr(d) //Baye’s Rule
Pr(d): same for all categories (ignore) Pr(Ci) = #docs Ci / #total docs
Pr(d|Ci) = wd Pr(w|Ci)– Words occur independently (unigram model)
Pr(w|Ci) = (n(Ci ,w)+) / (n(Ci)+ |V|)– Maximum likelihood estimate smoothed with the
Lidstone’s law of succession
Enhanced Algorithm
Pr(Ci|d,S) //d existed in category S= Pr(Ci,d,S) / Pr(d,S)
– Pr(Ci,d,S) = Pr(d,S) Pr(Ci|d,S)
= Pr(Ci)Pr(S,d|Ci) / Pr(d,S)= Pr(Ci)Pr(S|Ci)Pr(d| Ci) / Pr(S,d)
– Assuming d, S independent given Ci
= Pr(S)Pr(Ci|S)Pr(d| Ci) / Pr(S,d)– Pr(S|Ci) Pr(Ci) = Pr(Ci|S) Pr(S)
= Pr(Ci|S)Pr(d|Ci) / Pr(d|S)– Pr(S,d) = Pr(S)Pr(d|S)
Same as NB except Pr(Ci|S) instead of Pr(Ci)– Ignore Pr(d|S) as it is same for all classes
Computing Pr(Ci|S)
Pr(Ci|S) =
|Ci|(#docs in S predicted to be in Ci)w /
j[1,n] |Cj|(#docs in S predicted to be in Cj)w
|Ci| = #docs in Ci in the master catalog w determines weight of the new catalog
– Use a tune set of documents in the new catalog for which the correct categorization in the master catalog is known
– Choose one weight for the entire new catalog or different weights for different sections
Superiority of the Enhanced Algorithm Theorem: The highest possible accuracy
achievable with the enhanced algorithm is no worse than what can be achieved with the basic algorithm.
Catch: The optimum value of the weight for which enhanced achieves highest accuracy is data dependent.
The tune set method attempts to select a good value for weight, but there is no guarantee of success.
Empirical Evaluation
Start with a real catalog M Remove n products from M to form the new
catalog N In the new catalog N
– Assign f*n products to the same category as M– Assign the rest to other categories as per some
distribution (but remember their true category) Accuracy: Fraction of products in N assigned
to their true categories
Improvement in Accuracy (Pangea)
1 2 5 10 25 50 100 200
Weight
65
70
75
80
85
90
95
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Improvement in Accuracy (Reuters)
1 2 5 10 25 50 100 200
Weight
82
84
86
88
90
92
94
96
98
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Improvement in Accuracy (Google.Outdoors)
1 5 25 100 400 1000
Weight
50
60
70
80
90
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Tune Set Size (Pangea)
0 5 10 20 35 50
Tune Set Size
70
75
80
85
90
95A
ccu
racy
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Similar results for Reuters and Google.
Empirical Results
71-22-6 79-21 100
Purity (No. of classes & their distribution)
0
5
10
15
20
% E
rro
rs Standard
Enhanced