If you can't read please download the document
View
160
Download
0
Tags:
Embed Size (px)
Scaling Text Mining to One Million Documents Christophe Roeder, Karin Verspoor
Applying text mining to a large document collection demands more resources than the lab PC can provide. Preparing for such a task requires an understanding of the demands of the text mining software and capabilities of supporting hardware and software. We describe efforts to scale a large text mining task.
Corpus Management: Arrange access from publishers Download files Parse XML of various DTDs to plain text Parse PDF if XML not available Find or maintain section zoning information Track source and citation information Keep up to date with periodic updates
Pipeline: Error reporting Identify and restart after memory leaks Identify parameters passed to analytics Progress tracking, restart from last processed Identify individual document errors and continue processing others.
Analytics / Analysis Engines: Identify, Integrate into UIMA Check for possible concurrency issues Test for bugs, memory leaks Detailed error reporting Find memory and cpu requirements Track source, build and modifiation information
Scaling: UIMA CPE Threads: simple, effective limited to one machine UIMA AS: put heavy engines on other machines Grid Engine: move files, run scripts across a cluster Hadoop (map/reduce): elegant Java interface
Integration: Store semantic information in knowledge base for further processing. Web application to manage and initiate job runs Allow for change in one analytic and re-run partial pipeline
The Devil is in the Details
Scaling Framework Options UIMA CPE Basic UIMA pipeline engine Can run many threads Limted to one machine
UIMA AS (Asynchronous Scaleout) Uses message queues to link analytics on different machines Message queues allow for flexibility regarding time of message delivery Useful for putting many instances of a heavy analytic on a separate machine Can be used to run many pipelines on many machines XMI Serialization as overhead is not trivial
GridEngine Cluster management software makes it easy to copy to many machines at once Scripts can be started on many machines with one command
Hadoop Map-reduce implementation Map distributes, reduce collates Related tools very interesting: hdfs (hadoop file system) Behemoth: UIMA and GATE adapted to Hadoop
Resource Requirements of Selected Analytics
Name Time Memory*
XSLT Converter 0.03 sec./doc. < 256MB
XML Parser 0.02 sec./doc. < 256 MB
Sentence Detector 0.01 sec./doc. < 256 MB
POS Tagger 2.6 sec./doc. < 256 MB
Parser 1500 sec./doc. > 1GB
XMI Serialization 2.5 sec./doc. ** < 256 MB
Concept Mapper *** > 2GB
* Memory usage includes UIMA and other analytics, 64 bit JVM ** annotations from sentence detection, tokenization, and POS tagging, time includes file i/o *** Data not available, memory use from loading Swiss-Prot
Factors of ten in memory requirements and factors of 5 orders of magnitude in run times suggest a good pipeline description is vital for specifying hardware.
Acknowledgements: NIH grant R01-LM010120-01to Karin Verspoor and the SciKnowMine project funded by NSF grant #0849977 and supported by U24 RR025736-01, NIGMS: RO1-GM083871, NLM: 2R01LM009254, NLM:2R01LM008111, NLM:1R01LM010120-01, NHGRI:5P41HG000330
Output: Store all annotations RDB or serialized CAS Track provenance