Roeder posterismb2010

  • View

  • Download

Embed Size (px)

Text of Roeder posterismb2010

  • Scaling Text Mining to One Million Documents Christophe Roeder, Karin Verspoor

    [email protected]

    Applying text mining to a large document collection demands more resources than the lab PC can provide. Preparing for such a task requires an understanding of the demands of the text mining software and capabilities of supporting hardware and software. We describe efforts to scale a large text mining task.

    Corpus Management: Arrange access from publishers Download files Parse XML of various DTDs to plain text Parse PDF if XML not available Find or maintain section zoning information Track source and citation information Keep up to date with periodic updates

    Pipeline: Error reporting Identify and restart after memory leaks Identify parameters passed to analytics Progress tracking, restart from last processed Identify individual document errors and continue processing others.

    Analytics / Analysis Engines: Identify, Integrate into UIMA Check for possible concurrency issues Test for bugs, memory leaks Detailed error reporting Find memory and cpu requirements Track source, build and modifiation information

    Scaling: UIMA CPE Threads: simple, effective limited to one machine UIMA AS: put heavy engines on other machines Grid Engine: move files, run scripts across a cluster Hadoop (map/reduce): elegant Java interface

    Integration: Store semantic information in knowledge base for further processing. Web application to manage and initiate job runs Allow for change in one analytic and re-run partial pipeline

    The Devil is in the Details

    Scaling Framework Options UIMA CPE Basic UIMA pipeline engine Can run many threads Limted to one machine

    UIMA AS (Asynchronous Scaleout) Uses message queues to link analytics on different machines Message queues allow for flexibility regarding time of message delivery Useful for putting many instances of a heavy analytic on a separate machine Can be used to run many pipelines on many machines XMI Serialization as overhead is not trivial

    GridEngine Cluster management software makes it easy to copy to many machines at once Scripts can be started on many machines with one command

    Hadoop Map-reduce implementation Map distributes, reduce collates Related tools very interesting: hdfs (hadoop file system) Behemoth: UIMA and GATE adapted to Hadoop

    Resource Requirements of Selected Analytics

    Name Time Memory*

    XSLT Converter 0.03 sec./doc. < 256MB

    XML Parser 0.02 sec./doc. < 256 MB

    Sentence Detector 0.01 sec./doc. < 256 MB

    POS Tagger 2.6 sec./doc. < 256 MB

    Parser 1500 sec./doc. > 1GB

    XMI Serialization 2.5 sec./doc. ** < 256 MB

    Concept Mapper *** > 2GB

    * Memory usage includes UIMA and other analytics, 64 bit JVM ** annotations from sentence detection, tokenization, and POS tagging, time includes file i/o *** Data not available, memory use from loading Swiss-Prot

    Factors of ten in memory requirements and factors of 5 orders of magnitude in run times suggest a good pipeline description is vital for specifying hardware.

    Acknowledgements: NIH grant R01-LM010120-01to Karin Verspoor and the SciKnowMine project funded by NSF grant #0849977 and supported by U24 RR025736-01, NIGMS: RO1-GM083871, NLM: 2R01LM009254, NLM:2R01LM008111, NLM:1R01LM010120-01, NHGRI:5P41HG000330

    Output: Store all annotations RDB or serialized CAS Track provenance

Search related