1
Open Source Text Mining
Text Mining 2003 @ SDM03Cathedral Hill Hotel, San Francisco
Hinrich Schütze, Enkata
May 3, 2003
2
Motivation
Open source used to be a crackpot idea. Bill Gates on linux (1999.03.24): “I really don't think in the
commercial market, we'll see it in any significant way.” MS 10-Q quarterly filing (2003.01.31): “The popularization
of the open source movement continues to pose a significant challenge to the company's business model.”
Open source is an enabler for radical new things Google Ultra-cheap web servers
Free news Free email Free …
Class projects Walmart pc for $200
3
GNU-Linux
4
Web Servers: Open Source Dominates
Source: Netcraft
5
Motivation (cont.)
Text mining has not had much impact. Many small companies & small projects No large-scale adoption Exception: text-mining-enhanced search
Text mining could transform the world. Unstructured → structured Information explosion
Amount of information has exploded Amount of accessible information has not
Can open source text mining make this happen?
6
Unstructured vs Structured Data
0
10
20
30
40
50
60
70
80
90
100
Data volume Market Cap
UnstructuredStructured
Prabhakar Raghavan, Verity
7
Business Motivation
High cost of deploying text mining solutions
How can we lower this cost? 100% proprietary solutions
Require re-invention of core infrastructure Leave fewer resources for high-value
applications built on top of core infrastructure
8
Definitions
Open source Public domain, bsd, gpl (gnu public license)
Text mining Like data mining but for text NLP (Natural Language Processing)
subdiscipline Has interesting applications now More than just information retrieval /
keyword search Usually: some statistical, probabilistic or
frequentistic component
9
Text Mining vs. NLP (Natural Language Processing)
What is not text mining: speech, language models, parsing, machine translation
Typical text mining: clustering, information extraction, question answering
Statistical and high volume
10
Text Mining: History
80s: Electronic text gives birth to Statistical Natural Language Processing (StatNLP).
90s: DARPA sponsors Message Understanding Conferences (MUC) and Information Extraction (IE) community.
Mid-90s: Data Mining becomes a discipline and usurps much of IE and StatNLP as “text mining”.
11
Text Mining: Hearst’s Definition
Finding nuggets Information extraction Question answering
Finding patterns Clustering Knowledge discovery
Text visualization
12
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
Information Extraction
13
Knowledge Discovery: Arrowsmith
Goal: Connect two disconnected subfields of medicine.
Technique Start with 1st subfield Identify key concepts Search for 2nd subfield with same concepts
Implemented in Arrowsmith system Discovery: magnesium is potential
treatment for migraine
14
Knowledge Discovery: Arrowsmith
15
When is Open Source Successful?
“Important” problem Many users (operating system) Fun to work on (games) Public funding available (OpenBSD, security) Open source author gains
fame/satisfaction/immortality/community Adaptation
A little adaptation is easy Most users do not need any adaptation (out of the box use)
Incremental releases are useful Cost sharing without administrative/legal overhead
Dozens of companies with significant interest in linux (ibm …) Many of these companies contribute to open source This is in effect an informal consortium A formal effort probably would have killed linux. Same applies to text mining?
Also: bugs, security, high-availability, ideal for consulting & hardware companies like IBM
16
When is Open Source Not Successful?
Boring & rare problem Print driver for 10 year old printer
Complex integrated solutions QuarkXPress ERP systems
Good UI experience for non-geeks Apple Microsoft Windows (at least for now)
17
Text Mining and Open Source
Pro Important problem: fame, satisfaction,
immortality, community can be gained Pooling of resources / critical mass
Con Non-incremental? Most text mining requires significant
adaptation. Most text mining requires data resources as
well as source code. The need for data resources does not fit well
into the open source paradigm.
18
Text Mining Open Source Today
Lucene Excellent for information retrieval, but not
much text mining. Rain/bow, Weka, GTP, TDMAPI
Text mining algorithms / infrastructure, no data resources
NLTK NLP toolkit, some data resources
WordNet, DMOZ Excellent data resources, but not enough
breadth/depth.
19
Open Source with Open Data
Spell checkers (e.g., emacs) Antispam software (e.g., spamassassin) Named entity recognition (Gate/Annie)
Free version less powerful than in-house
20
SpamAssassin: Code + Data
21
Open Data Resources: Examples
SpamAssassin Classification model for spam
Named entity recognition Word lists, dictionaries
Information extraction Domain model, taxonomies, regular
expressions Shallow parsing
Grammars
22Code
Da
ta
?
Proprietary Open Source
No Resources
Needed
Significant Resources
Needed
Code vs Data
Text ClassificationN. Entity Recognition
Information Extraction
Complex&Integrated SWGood UI Design
LinuxWeb Servers
Spam FilteringSpell Checkers
23
Open Source with Data: Key Issues
Can data resources be recycled? Problems have to be similar. More difficult than one would expect: my first
attempt failed (medline/reuters). Next: case study
Assume there is a large library of data resources available.
How do we identify the data resources that can be recycled?
How do we adapt them? How do we get from here to there?
Need incremental approach that is sustained by successes along the way.
24
Text Mining without Data Resources
Premise: “Knowledge-poor” text mining taps small part of potential of text mining.
Knowledge-poor text mining examples Clustering Phrase extraction First story detection
Many success stories
25
Case Study: ODP -> ReutersCase Study:Train on ODP
Apply to Reuters
26
Case Study: Text Classification
Key Issues for text classification Show that text classifiers can be recycled How can we select reusable classifiers for a
particular task? How do we adapt them?
Case Study Train classifiers on open directory (ODP)
165,000 docs (nodes), crawled in 2000, 505 classes
Apply classifiers to Reuters RCV1 780,000 docs, >1000 classes
Hypothesis: A library of classifiers based on ODP can be recycled for RCV1.
27
Experimental Setup
Train 505 classifiers on ODP Apply them to Reuters Compute chi2 for all ODP x Reuters pairs Evaluate n pairs with the best chi2 Evaluation Measures
Area under ROC curve Plot false positive rate vs true positive rate Compute area under the curve
Average precision Rank documents, compute precision for each rank Average for all positive documents
Estimated based on 25% sample
28
Japan: ODP -> ReutersROC Curve
Japan Classifier Trained on ODP Applied to Reuters
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
False Positive Rate
Tru
e P
osi
tive
Rat
e
29
Some Results
30
BusIndTraMar0 / I76300: Ports
31
Discussion
Promising results These are results without any
adaptation. Performance expected to be much better
after adaptation.
32
Discussion (cont)
Class relationships are m:n, not 1:1 Reuters: GSPO
SpoBasCol0 SpoBasMinLea0 SpoBasReg0 SpoHocIceLeaNatPla0 SpoHocIceLeaPro0
ODP: RegEurUniBusInd0 (UK industries) I13000 (petroleum & natural gas) I17000 (water supply) I32000 (mechanical engineering) I66100 (restaurants, cafes, fast food) I79020 (telecommunications) I9741105 (radio broadcasting)
33
Why Recycling Classifiers is Difficult
Autonomous vs relative decisions ODP Japan classifier w/o modifications has
high precision, but only 1% recall on RCV1! Most classifiers are tuned for optimal
performance in embedded system. Tuning decreases robustness in recycling. Tokenization, document length, numbers Numbers throw off medline vs. non-medline
categorizer (financial classified as medical) Length-sensitive multinomial Naïve Bayes:
nonsensical results
34
Specifics
What would an open source text classification package look like?
Code Text mining algorithms Customization component
To adapt recycled data resources Creation component
To create new data resources Data
Recycled data resources Newly created data resources
Pick a good area Bioinformatics: genes / proteins Product catalogs
35
Other Text Mining Areas
Named entity recognition Information extraction Shallow parsing
36
Data vs Code
What about just sharing training sets? Often proprietary
What about just sharing models? Small preprocessing changes can throw you
off completely Share (simple?) classifier cum
preprocessor and models Still proprietary issues
37
Open Source & Data
Sanitized&Enhanced
Code+Data
EnhancedCode+Data
adapt
Public Proprietary
Code+DataV1.0
Code+DataV1.1
publish
san
itiz
e
new
rele
ase
38
Free Riders?
Open source is successful because it makes free riding hard. Viral nature of GPL.
Harder to achieve for some data resources Download models Apply to your data Retrain You own 100% of the result
Less of a problem for dictionaries and grammars
39
Data Licenses
Open Directory License http://rdf.dmoz.org/license.html Bsd flavor
Wordnet http://www.cogsci.princeton.edu/~wn/
license.shtml Copyright
No license to sell derivative works? Some criteria for derivative works
Substantially similar (seinfeld trivia) Potential damage to future marketing of derivative
works
40
Code vs Data Licenses
Some similarity If I open-source my code, then I will benefit
from bug fixes & enhancements written by others.
If I open-source my data resource, then my classification model may become more robust due to improvements made by others.
Some dissimilarity Code is very abstract: few issues with
proprietary information creeping in. Text mining resources are not very abstract:
there is a potential of sensitive information leaking out.
41
Areas in Need of Research
How to identify reusable text mining components ODP/Reuters case study does not address this. Need (small) labeled sample to be able to do this?
How to adapt reusable text mining components Active learning Interactive parameter tweaking? Combination of recycled classifier and new training
information Estimate performance
Most estimation techniques require large labeled samples.
The point is to avoid construction of a large labeled sample.
Create viral license for data resources.
42
Summary
Many interesting research issues Need institution/individual to take the lead Need motivated network of contributors
data resource contributors source code contributors
Start with small & simple project that proves idea
If it works … text mining could become an enabler on a par with linux.
43
More Slides
44
RegAsiJap0 JAP 0.86 0.62
RegAsiPhi0 PHLNS 0.91 0.56
RegAsiIndSta0 INDIA 0.85 0.53
SpoSocPla0 CCAT 0.60 0.53
RegEurRus0 CCAT 0.58 0.51
RegEurRus0 RUSS 0.85 0.51
SpoSocPla0 GSPO 0.78 0.42
SpoBasReg0 GSPO 0.75 0.33
RegAsiIndSta0 MCAT 0.56 0.32
SpoBasPla1 GSPO 0.80 0.31
SpoBasCol0 GSPO 0.78 0.31
SpoBasCol1 GSPO 0.74 0.26
RegEurSlo0 SLVAK 0.86 0.25
SpoBasPla0 GSPO 0.77 0.24
RegEurRus0 MCAT 0.49 0.23
BusIndTraMar0 I76300 0.81 0.23
SpoHocIceLeaPro0 GSPO 0.71 0.20
SpoBasMinLea0 GSPO 0.71 0.20
RegMidLeb0 LEBAN 0.83 0.19
RecAvi0 I36400 0.74 0.18
RegSou0 BRAZ 0.84 0.18
RegAsiHonBus0 HKONG 0.66 0.18
SpoMotAut0 GSPO 0.67 0.18
SpoHocIceLeaNatPla0 GSPO 0.72 0.17
SocPol0 EEC 0.85 0.17
RegAsiIndSta0 M14 0.59 0.17
RegAsiChiPro0 CHINA 0.67 0.17
RecAvi0 I3640010 0.77 0.17
SpoFooAmeColNca1 GSPO 0.72 0.17
SocPol0 G15 0.86 0.16
RegEurBul0 BUL 0.72 0.15
RegAsiIndPro0 INDON 0.72 0.13
SpoSocPla0 UK 0.49 0.12
RegEurUkr0 UKRN 0.73 0.11
RegEurRus0 GPOL 0.48 0.11
RegEurPolVoi0 POL 0.67 0.11
RegAsiIndSta0 M141 0.61 0.10
SpoFooAmeNflPla0 GSPO 0.65 0.09
RegEurGerSta0 GFR 0.56 0.09
RegEurFra0 FRA 0.54 0.09
RegCar0 CUBA 0.76 0.09
RegEurUniBusInd0 C18 0.59 0.08
RegEurUniEngEss0 I66200 0.72 0.08
RegSou0 PERU 0.88 0.08
ComHar0 C22 0.61 0.08
RegMidTur0 TURK 0.69 0.08
RegAsiIndSta0 M13 0.56 0.08
RegEurUniBusInd0 C181 0.59 0.07
RegNorUniCalLocPxx0 LATV 0.64 0.07
RegEurRus0 GVIO 0.52 0.07
SpoSocPla0 ITALY 0.58 0.07
RegEurUniSco0 GSPO 0.54 0.07
RegEurNet0 NETH 0.65 0.07
RegEurRus0 GDIP 0.46 0.07
ArtMusStyCouBan0 GENT 0.52 0.07
RegEurRus0 BYELRS 0.92 0.06
BusIndTraMar0 C24 0.54 0.06
BusIndTraMar0 I74000 0.72 0.06
RegNorMexSta0 I76300 0.58 0.06
SpoHocIceLeaNatPla0 CANA 0.54 0.06
RegSou0 MRCSL 1.00 0.06
SocRelBud0 GREL 0.57 0.05
RegEurBel0 FRA 0.49 0.05
SpoSocPla0 FRA 0.50 0.05
RegEurUniBusInd0 I6540005 0.69 0.05
RegNorCanQueLoc0 FRA 0.46 0.05
RegEurGerSta0 GSPO 0.45 0.05
RegAsiIndSta0 M131 0.61 0.05
RegAsiPak0 SHAJH 0.76 0.05
SpoSocPla0 GFR 0.48 0.05
RegSou0 PARA 0.90 0.04
RegEurUniBusInd0 I9741109 0.59 0.04
RegSou0 BOL 0.90 0.04
RegEurRus0 UKRN 0.83 0.04
SpoSocPla0 SPAIN 0.61 0.04
NewOnlCnn0 BAH 0.56 0.04
ArtAniVoi0 I97100 0.70 0.03
RegEurRus0 NATO 0.75 0.03
RegEurRus0 GDEF 0.55 0.03
SpoSocPla0 MONAC 0.87 0.03
SciEarPal0 GSCI 0.42 0.03
RegEurRom0 ROM 0.57 0.03
RegAsiPhi0 I85000 0.66 0.03
SpoBasReg0 SPAIN 0.59 0.03
BusIndTraMar0 USSR 0.47 0.03
SpoSocPla0 NETH 0.54 0.03
SpoFooAmeNflPla0 CANA 0.48 0.03
RegEurRus0 AZERB 0.94 0.03
SciBioTaxTaxPlaMagMag0 ECU 0.54 0.03
RegNorUniCalLocPxx0 I41500 0.65 0.02
RegEurRus0 TADZK 0.95 0.02
RegEurUniBusInd0 I8150206 0.71 0.02
RegEurUniBusInd0 I81502 0.58 0.02
RegSou0 URU 0.88 0.02
RegEurUniBusInd0 I50300 0.74 0.02
RegEurUniBusInd0 I37100 0.79 0.02
RefFlaReg0 GUREP 0.69 0.02
SciBioTaxTaxPlaMagMag0 I0100144 0.58 0.02
NewOnlCnn0 GWEA 0.66 0.02
RegEurUniBusInd0 I85000 0.57 0.02
ArtCelMxx0 I97100 0.66 0.02
SpoMotAut0 SMARNO 0.88 0.02
RegEurUniBusInd0 I5020022 0.79 0.02
NewOnlCnn0 DOMR 0.55 0.02
ArtMusStyCouBan0 GPRO 0.45 0.02
RegEurUniEngEss0 I83954 0.66 0.02
SpoBasReg0 GREECE 0.51 0.02
RegEurRus0 GRGIA 0.84 0.02
RegEurRus0 KAZK 0.82 0.02
RegEurNet0 M142 0.45 0.02
RegEurUniBusInd0 I83200 0.67 0.01
NewOnlCnn0 BELZ 0.50 0.01
RegEurUniBusInd0 C34 0.49 0.01
RegEurUniEngEss0 I82002 0.56 0.01
SpoBasReg0 ISRAEL 0.38 0.01
RegEurUniBusInd0 I83400 0.73 0.01
RegEurUniBusInd0 I83954 0.67 0.01
RegEurPolVoi0 FIN 0.58 0.01
RegEurRus0 USSR 0.82 0.01
RegEurUniBusInd0 I9741105 0.58 0.01
RegEurUniBusInd0 I32852 0.80 0.01
RegEurUniBusInd0 I83940 0.63 0.01
BusIndTraMar0 BUL 0.37 0.01
RegEurUniBusInd0 I61000 0.68 0.01
BusIndTraMar0 ESTNIA 0.60 0.01
NewOnlCnn0 GABON 0.46 0.01
NewOnlCnn0 CVI 0.70 0.01
SciBioTaxTaxAniChoAve0 GENV 0.45 0.01
SpoMotAut0 MONAC 0.71 0.01
ArtCelBxx0 I97100 0.64 0.01
SpoBasReg0 TURK 0.46 0.01
BusIndTraMar0 PORL 0.57 0.01
SpoBasReg0 CRTIA 0.48 0.01
RegEurUniBusInd0 I95100 0.65 0.01
BusIndTraMar0 CRTIA 0.41 0.01
BusIndTraMar0 UKRN 0.43 0.01
ArtCelLxx0 I97100 0.60 0.01
RegEurRus0 MOLDV 0.78 0.01
RegSou0 SURM 0.80 0.01
BusIndTraMar0 LATV 0.60 0.01
BusIndTraMar0 ALB 0.24 0.01
BusIndTraMar0 LITH 0.58 0.01
ArtCelSxx0 I97100 0.63 0.01
RegEurUniBusInd0 I16000 0.59 0.01
SpoBasCol0 E71 0.42 0.01
SciBioTaxTaxPlaMagMag0 BELZ 0.53 0.01
ArtMusStyCouBan0 GOBIT 0.53 0.01
BusFinBanBanReg0 C173 0.68 0.01
RegEurRus0 ARMEN 0.85 0.01
RegEurRus0 I22471 0.66 0.01
RegEurRus0 TURKM 0.86 0.01
BusIndTraMar0 ROM 0.40 0.01
BusIndTraMar0 TUNIS 0.67 0.00
RegAsiChiPro0 I5020006 0.76 0.00
ArtTelNet0 I9741105 0.67 0.00
BusIndTraMar0 YEMAR 0.49 0.00
BusIndTraMar0 CYPR 0.40 0.00
RefFlaReg0 SLVNIA 0.57 0.00
RegEurUniEngEss0 I9741105 0.57 0.00
RegEurRus0 KIRGH 0.83 0.00
RegCar0 GTOUR 0.55 0.00
BusIndTraMar0 UAE 0.48 0.00
NewOnlCnn0 BERM 0.52 0.00
BusIndTraMar0 NAMIB 0.48 0.00
BusIndTraMar0 JORDAN 0.36 0.00
RecAvi0 C313 0.42 0.00
BusIndTraMar0 MOZAM 0.51 0.00
RegEurUniBusInd0 I66200 0.66 0.00
BusIndTraMar0 SILEN 0.34 0.00
RegMidLeb0 I9741105 0.54 0.00
RegAsiHonBus0 I81400 0.61 0.00
RefFlaReg0 WORLD 0.43 0.00
RegNorUniCalLocVxx0 C313 0.39 0.00
RegAsiHonBus0 I64700 0.72 0.00
RefFlaReg0 UPVOLA 0.58 0.00
SciBioTaxTaxPlaMagMag0 I0100216 0.66 0.00
RegAsiHonBus0 I3640048 0.70 0.00
SciBioTaxTaxAniChoAve0 AARCT 0.53 0.00
RegSou0 I5020051 0.84 0.00
NewOnlCnn0 TCAI 0.00 0.00
45
Resources http://www-csli.stanford.edu/~schuetze (this talk, some
additional material) Source of Gates quote:
http://www.techweb.com/wire/story/TWB19990324S0014 Kurt D. Bollacker and Joydeep Ghosh. A scalable method for
classifier knowledge reuse. In Proceedings of the 1997 International Conference on Neural Networks, pages 1474-79, June 1997. (proposes measure for selecting classifiers for reuse)
W.Cohen, D.Kudenko: Transferring and Retraining Learned Information Filters, Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI 97. (transfer within the same dataset)
Kurt D. Bollacker and Joydeep Ghosh. A supra-classifier architecture for scalable knowledge reuse. In The 1998 International Conference on Machine Learning, pp. 64-72, July 1998. (transfer within the same dataset)
Motivation of open source contributors: http://newsforge.com/newsforge/03/04/19/2128256.shtml?tid=11, http://cybernaut.com/modules.php?op=modload&name=News&file=article&sid=8&mode=thread&order=0&thold=0