View
73
Download
4
Category
Preview:
Citation preview
1
Classification of CNN.com Articles using a TF*IDF MetricMarie Vans and Steven Simske HP Labs;Fort Collins, Colorado April 20, 2016
2
Agenda• TF*IDF Family of Metrics• Word Frequencies• Data Set & Preprocessing• Algorithms for word frequencies and
classification• An example• Results• Future Directions• Conclusions
3
TF*IDF – Family – Term Frequency TF Name Equation
1 Power
2 Mean
3 NormLog
4 Log
5 NormLogs
6 NormMean
7 NormPower
8 NormPowers
4
TF*IDF – Family – Inverse Document Frequency
IDF Name IDF Equation1 NormLogsOfSums if LogRatio ≥ MinLogRatio
if LogRatio < MinLogRatio2 NormSumsOfLogs if LogRatio ≥ MinLogRatio
3 SumOfPowers
4 PowerOfSums
5
TF*IDF – Family – Inverse Document Frequency
IDF Name IDF Equation
5 Mean
6 NormSumOfLogs
7 NormLogOfSums
8 NormSumOfPowers
9 NormSumsOfPowers
10 SumOfLogs
11 LogOfSums
12 NormMean
6
TF*IDF – Family – Inverse Document Frequency
IDF Name IDF Equation
13 NormPowerOfSums
14 NormPowersOfSums
i =current wordj = current documentk = total words in document jn = total words in other than current document N = total number of documents in the corpuswi,j = number of occurrences of word i in document j.wi,n = word occurrences of word i in other documents.ni = number of documents in which i occurs.LogRatio = ratio of log for individual word to log for document lengthMinLogRatio = user settable minimum for LogRatioWordPower & DocPower = adjustable value
7
TF*IDF – Family – Putting it togetherTF_Power*IDF_NormLogsOfSums If LogRatio ≥ MinLogRatio
* if LogRatio < MinLogRatioTF_Power*IDF_NormSumsOfLogs *
* ...TF_NormPowers*IDF_NormPowersOfSums *
112 TF*IDF Equations
8
Word Frequencies
)
∑𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑓 ¿¿
The frequency of word i in file j
The frequency of word i in nfiles
9
CNN Data SetClass Name TTL Number
of FilesNumber of
FilesTraining Set
Number of Files
Test SetBusiness 161 81 80Health 290 145 145Justice 224 112 112Living 98 49 49Opinion 192 96 96Politics 195 98 97Showbiz 241 121 120Sport 148 74 74Tech 132 66 66Travel 171 86 85US 160 80 80World 988 494 494
• 12 Classes• 3,000 Total
Files• Each Class
split into 2 sets:• Training
Set• Test Set
• File Classes Ground-trouth by CNN
Rafael Dueire Lins, Steven J. Simske, Luciano de Souza Cabral, Gabriel de Silva, Rinaldo Lima, Rafael F. Mello, and Luciano Favaro.A multi-tool scheme for summarizing textual documents. In Proceedings of 11st IADIS International Conference WWW/INTERNET 2012,pages 1–8, July 2012
10
CNN Data SetClass Name TTL Number of
Train Set Unique Words
TTL Number of Test Set
Unique Words
Total Number of
Words Processed
Business 8278 7851 16129Health 12246 12036 24282Justice 9133 9032 18165Living 7936 7030 14966Opinion 11382 10886 22268Politics 9268 9039 18307Showbiz 8997 9949 18946Sport 7445 7191 14636Tech 7971 7548 15519Travel 14931 12612 27543US 8488 8707 17195World 22936 23441 46377
• 12 Classes• Total words
254,333 • Training Set
129,011 • Test Set
125,322
11
Preprocessing• Remove “stop words”• Remove punctuation (hyphenation excepted)• No lemmatization • SharpNLP – Open Source Natural Language Processing (
https://sharpnlp.codeplex.com/)• sentence splitter• tokenizer• part-of-speech tagger• chunker • parser• name finder• coreference tool• interface to the WordNet lexical database
• File parsed with each word tagged with part of speech
12
Program Classes (Not CNN Classes)• Word Class
• m_Spelling • m_Count (frequency of word in file)• m_Weight (assigned by different TF*IDF
measures)• m_HasHyphen (Hyphenated words counts as
single word)• m_PennTags (Parts of speech tag)• m_Tags (Number of tags associated
with word)
• TermFrequencies Class• m_TermName;• int m_TermFreq;
• Classify Class• m_businessWords;• m_healthWords;• m_justiceWords;• m_livingWords;• m_opinionWords;• m_politicsWords;• m_showbizWords;• m_sportWords;• m_techWords;• m_travelWords;• m_usWords;• m_worldWords;• m_confusionMatrx
13
AlgorithmA. Using Training Set files in each class: (i.e. do this
12 times)1.0 For each file in the set:
Create a word object for every unique word in the file2.0 Count the total number of occurrences of each
unique word for the entire set of documents3.0 Calculate the weight of each word:
total occurrences of wordi in all files / total occurrences of all words in all files
14
AlgorithmB. Using the Testing Set files in a specific class: (i.e. business)
1.0 For each file in the set:Create a word object for every unique word in the file
2.0 Count the total number of occurrences of each unique word for the entire set of documents
3.0 Calculate the weight of each word: = total occurrences of wordi in file / total occurrences of all
words in file = total occurrences of wordi in all files / total occurrences of all words in all files
𝑇𝑤𝑜𝑟𝑑𝑖= ∑
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑓 ¿¿¿
15
Algorithm
D. Classify each wordi in one test file by comparing to the same word in all training classes: in Business test class
𝐶h h𝑒𝑎𝑙𝑡 =𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖
× h𝐻𝑒𝑎𝑙𝑡 𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖
𝐶 𝑗𝑢𝑠𝑡𝑖𝑐𝑒=𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖× 𝐽𝑢𝑠𝑡𝑖𝑐𝑒𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖
.
.
.𝐶𝑤𝑜𝑟𝑙𝑑=𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖
×𝑊𝑜𝑟𝑙𝑑𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖
𝐶𝑙𝑎𝑠𝑠=𝑀𝑎𝑥 { {
16
Algorithm
C. Classify each wordi in the entire test class by comparing to the same word in all training classes: in Business test class
𝐶h h𝑒𝑎𝑙𝑡 =𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖
× h𝐻𝑒𝑎𝑙𝑡 𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖
𝐶 𝑗𝑢𝑠𝑡𝑖𝑐𝑒=𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖× 𝐽𝑢𝑠𝑡𝑖𝑐𝑒𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖
.
.
.𝐶𝑤𝑜𝑟𝑙𝑑=𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖
×𝑊𝑜𝑟𝑙𝑑𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖
𝐶𝑙𝑎𝑠𝑠=𝑀𝑎𝑥 { {
17
CNN Data Set – Example Article – Business ClassAfter Fukushima: Could Germany's nuclear gamble backfire?
As Germany's switchover from nuclear power to renewable energy gathers pace, concerns are mounting over the cost to the country's prosperity and its already squeezed consumers.Politicians in Europe's largest economy want renewable power to contribute 35% of the country's electricity consumption by 2020 and 80% by 2050 as part of its clean energy drive.The country's 'energiewende' -- translated as energy transformation -- is part of the government's plan to move away from nuclear power and fossil fuels to renewable energy sources, following Japan's Fukushima disaster in 2011.Michael Limburg, vice-president of the European Institute for Climate and Energy, told CNN that the government's energy targets are 'completely unfeasible.''Of course, it's possible to erect tens of thousands of windmills but only at an extreme cost and waste of natural space,' he said.'And still it would not be able to deliver electricity when it is needed.'The government is investing heavily in onshore and offshore wind farms and solar technology in an effort to reduce 40% of greenhouse gas emissions by 2020.Last year Chancellor Angela Merkel, who this week won her third term as Germany's leader, proposed to construct offshore wind farms in the North Sea, a plan that would cost 200 billion euros ($270 billion), according to the DIW economic institute in Berlin.As part of the energy drive, Merkel also pledged to permanently shut down the country's 17 nuclear reactors, which fuel 18% of the country's power needs.Under Germany's Atomic Energy Act, the last nuclear power plant will be disconnected by 2022.
18
CNN Data Set – Example Frequencies – Training Set
m_TermFreq 6 intm_TermName "fukushima"
string
m_TermFreq 1 intm_TermName "germany"string
m_TermFreq 12 intm_TermName "nuclear"string
Single Filem_TermFreq 9 intm_TermName "fukushima"
string
m_TermFreq 26 intm_TermName "germany"string
m_TermFreq 33 intm_TermName "nuclear"string
All Files in Class
fukushima 0.000307188203972967 germany 0.000887432589255239 nuclear 0.00112635674790088
% Occurrence in Class
% Occurrence in Filefukushima 0.0102739726027397
germany 0.00156739811912226 nuclear 0.0205479452054795
19
CNN Data Set – Example Frequencies – Test Set
m_TermFreq 2 intm_TermName "fukushima"
string
m_TermFreq 9 intm_TermName "germany"string
m_TermFreq 5 intm_TermName "nuclear"string
Single Filem_TermFreq 2 intm_TermName "fukushima"
string
m_TermFreq 21 intm_TermName "germany"string
m_TermFreq 6 intm_TermName "nuclear"string
All Files in Class
fukushima 0.000073773515308004germany 0.000774621910734047nuclear 0.000221320545924013
% Occurrence in Class
% Occurrence in Filefukushima 0.0056022408963585
germany 0.0252100840336134nuclear 0.0140056022408964
20
Classify All Words in Single Business Test File
Business 0.00069364Health 0.00030063
Justice 0.00025000Living 0.00026707
Opinion 0.00033446Politics 0.00034694Showbiz 0.00025372Sport 0.00029984Tech 0.00033337Travel 0.00023201 US 0.00031539 World 0.00040208
MAX Class Value
21
Classify All Words in All Business Test Files
Business 0.00059513Health 0.00035854Justice 0.00027830Living 0.00038269Opinion 0.00039295Politics 0.00036828Showbiz 0.00029162Sport 0.00036698Tech 0.00040147Travel 0.00032406US 0.00032592World 0.00037747
MAX Class Value
22
Confusion Matrix
• Each column contains samples of classifier output• Each row contains samples in true class• Each row sums to 1.0 • Diagonal show percent classified correctly
• Mean of diagonal = 89% • Off-diagonal shows types of errors that occur
• A is misclassified as B – 3% • A is misclassified as C – 3%
Normalized Confusion Matrix Classifier Output (Computed Classification)
Prediction
A B C
True Class of
the Samples
(Input)
A 0.94 0.03 0.03
B 0.08 0.85 0.07
C 0.08 0.04 0.88
23
Results - Classificationbusiness
health
justice living
opinion
politics
showbiz sport tech travel us world
business 0.75 0 0 0.0875 0 0 0 0.025 0.1125 0.025 0 0
health 00.77
240.020
7 0.1793 0 0.0138 0.0069 0 0.0069 0 0 0
justice 0 00.901
8 0.0179 0 0.0446 0.0089 0 0 0 0.0268 0
living 0.02040.040
8 00.816
3 0 0.0204 0 0.0612 0.0408 0 0 0
opinion 0.20830.072
90.020
8 0.2708 0.0313 0.1667 0 0.0417 0.0417 0 0.0104 0.1354
politics 0.01030.010
30.051
5 0.0412 00.855
7 0 0 0 0 0 0.0309
showbiz 0.00830.008
30.158
3 0.1417 0 0.00830.641
7 0.025 0 0 0.0083 0
sport 0.0270.013
50.040
5 0.0541 0 0.027 00.810
8 0 0.027 0 0
tech 0.03030.030
30.015
2 0.2121 0 0 0.0152 00.681
8 0.0152 0 0
travel 0.14120.011
80.023
5 0.1412 0 0.0824 0 0.0353 0.05880.435
3 0.0471 0.0235
us 0.025 0.050.312
5 0.175 0 0.1125 0.025 0.0625 0.0375 0.01250.187
5 0
world 0.07690.014
20.131
6 0.0789 0.002 0.1255 0.0061 0.0142 0.0202 0.002 0.01420.514
2
Note that the diagonals (in bold) are the correct classificationsThe rows sum to 1.0 since the left column represents the actual class from which the document is takenThe columns have a mean of 1.0 with some variance depending on whether the class in the column is an attractor class (> 1.0) or a repulsor class (<1.0)
24
Example of Incorrectly Classified File
Business 0.00033924Health
0.00025056Justice
0.00027728Living 0.00027807
Opinion 0.00041936
Politics0.00046704
Showbiz 0.00023136
Sport 0.00028422
Tech 0.00025991
Travel 0.00021793
US 0.00032973
World 0.00043251
2nd MAX Class Value
Results of File from Opinion Test Class:
MAX Class Value3rd MAX Class Value
It takes 3 tries to get it right
25
Classification Attempts to SuccessMeasures the average number of attempts until correct class is chosen
whereP1 = number correctly classified on first tryp2 = number correctly classified after two tries...P12 = number correctly classified on the last trynfiles = number of files in testing class
Example: Worst Class – Opinion
P1 = 3 P2 = 35 x 2P3 = 31 x 3P4 = 14 x 4P5 = 7 x5P6 = 3 x6P7 = 2 x 7P8 = 1 x8
=297/96 = 3.09Ʃ
26
Results - Classification Attempts to Success
• Measures the average number of attempts until correct class is chosen• Ideal is 1.0 – We get it right on the first try• Best Class – Justice • Correctly classified: 0.9018• Mean classification attempts: 1.19• Delta from ideal = 0.19
• Worst Class – Opinion• Correctly classified: 0.0313• Mean classification attempts: 3.09• Delta from ideal = 2.09
• Best classification attempts class 11 times better than worst class• All other classes between best and worst
27
Results - General• Confusion matrix shows good classification results
• Average classification rate for all classes = 0.61655883
• Classification Errors:
• Attractor classes:
• Repulsor classes:
• Normalized by total occurrences of all words in file • For classification of single file
• Normalized by total occurrences of all words in the class• For classification of multiple files
business
health justice living opinion
politics
showbiz
sport tech travel us world
1.2978
1.0245
1.6765
2.216
0.0333
1.4569
0.7037
1.0757
1.0003
0.517
0.2943
0.704
Business
Health Justice Living Politics Sport Tech
Opinion Showbiz Travel U.S. World
28
Discussion
OpinionJustice
World
U.S.1.Documents more
varied?2.Generic words?3.Topics overlap?4.Word clusters are
broader?
29
Future Directions• Automatic summarization based on word frequencies in sentences
• Data from Brazil also contained Gold Standard sentences for summarization• Each file contains sentences pulled out of the full article by at least 3
students• Gold Standard sentences for each file act as ground truth for automatic
summarization• New York Times Annotated Corpus:
(https://catalog.ldc.upenn.edu/LDC2008T19)• Written and published by the New York Times between January 1, 1987 and
June 19, 2007 • Metadata provided by the New York Times Newsroom, the New York Times
Indexing Service and the online production staff at nytimes.com:• Over 1.8 million articles• Over 650,000 article summaries written by library scientists• Over 1,500,000 articles manually tagged by library scientists with tags
drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors
• Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com
30
Conclusions• A family of TF*IDF metrics for summarization and
classification• A simple TF*IDF metric • Classification scheme that works well on a set of 3,000 CNN
articles separated into 12 classes• Classification attempts to success is a measure that tells us
how hard it is to classify• Attractor and repulsor classes may help for identifying
imbalances in the data• Simple TF*IDF metric can be used for benchmarking the rest
of the 112 TF*IDF
31
Thank You for Your Kind Attention
Recommended