31
Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Embed Size (px)

Citation preview

Page 1: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLPSujit Pal, Elsevier Labs

Page 2: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Introduction

• About Me

– Work at Elsevier Labs.

– Interested in Search, NLP and Distributed Processing.

– URL: labs.elsevier.com

– Email: [email protected]

– Twitter: @palsujit

• About Elsevier

– World’s largest publisher of STM Books and Journals.

– Uses Data to inform and enable consumers of STM info.

– And like everybody else, we are hiring!

Page 3: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Agenda

• Overview and Background

• Features and API

• Scaling out

• Q&A

Page 4: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Overview/Background

Page 5: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Problem Definition

• What is the problem?

– Annotate millions of documents from different corpora.

• 14M docs from Science Direct alone.

• More from other corpora, dependency parsing, etc.

– Critical step for Machine Reading and Knowledge Graph applications.

• Why is this such a big deal?

– Takes advantage of existing linked data.

– No model training for multiple complex STM domains.

– However, simple until done at scale.

Page 6: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Annotation Pipeline

Page 7: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Dictionary Based NE Annotator (SoDA)

• Part of Document Annotation Pipeline.

• Annotates text with Named Entities from external Dictionaries.

• Built with Open Source Components

– Apache Solr – Highly reliable, scalable and fault-tolerant search index.

– SolrTextTagger – Solr component for text tagging, uses Lucene FST technology.

– Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text.

– Apache Spark – Lightning fast, large scale data processing.

• Uses ideas from other Open Source libraries

– FuzzyWuzzy – Fuzzy String Matching like a boss.

• Contributed back to Open Source

– https://github.com/elsevierlabs-os/soda

Page 8: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

SoDA Architecture

Page 9: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

How does it work (exact/case matching)?

• Uses Aho-Corasick algorithm – treats the dictionary as a FST and streams text against it. Matches all patterns simultaneously. Diagram shows FST for vocabulary {“his”, “he”, “her”, “hers”, “she”}.

• Michael McCandless implemented FSTs in Lucene (blog post).

• David Smiley built SolrTextTagger to use Lucene FSTs.

• SoDA uses SolrTextTagger for streaming exact and case-insensitive matching.

Page 10: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

How does it work (fuzzy matching)?

• Pre-normalizes each dictionary entry into various forms

– Original – “Astrocytoma, Subependymal Giant Cell”

– Lowercased – “astrocytoma, subependymal giant cell”

– Punctuation – “astrocytoma subependymal giant cell”

– Sorted – “astrocytoma cell giant subependymal”

– Stemmed – “astrocytoma cell giant subependym”

• Uses OpenNLP to parse input text into phrases, normalizes each phrase into the desired normalization level and matches against corresponding field.

• Caller specifies normalization level.

Page 11: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Features and API

Page 12: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Feature Overview

• Provides JSON over HTTP interface

– Compose request as JSON document

– HTTP POST document to JSON endpoint URL (HTTP GET for URL only requests).

– Receive response as JSON document.

• Language-Agnostic and Cross-Platform.

• API can be used from standalone clients, Spark jobs and Databricks notebooks.

• Examples in Scala and Python

Page 13: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Services

• Status

– index.json – returns a JSON (suitable for health check monitoring)

• Single Lexicon Services

– annot.json – annotates a block of text in streaming manner. Supports different levels of matching (strict to permissive).

– matchphrase.json – annotates short phrases. Supports same matching levels as annot.json.

• Multi-Lexicon Services

– dicts.json – lists all lexicons available.

– coverage.json – returns number of annotations by lexicon found for text across all available lexicons.

• Indexing Services

– delete.json – deletes entire lexicon from index.

– add.json – adds an entry to the specified lexicon.

Page 14: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Annotation Service I/O

Example annotation request

{

“lexicon” : “countries”,

“text” : “Institute of Clean Coal Technology, East China University”,

“matching” : “exact”

}

Example annotated response[ { “id” : “http://www.geonames.org/CHN”, “lexicon” : “countries”, “begin” : 41, “end” : 46, “coveredText” : “China”, “confidence” : 1.0 }]

Page 15: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Calling Annotation Service

• Client originally written in Python, using built-in json and requests libraries.

• For Scala client, SoDA JAR provides classes to mimic json and requests functionality in Scala.

• Input to both our (somewhat contrived) examples are: (pii: String, affStr: String) tuples as shown.

• Match against a country lexicon to find country names.

Page 16: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Annotation Service – Python Client

Page 17: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Annotation Service – Scala Client

Page 18: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Annotation Service - Outputs

• Each Annotation result provides:– Entity ID (not shown)– Begin position in text– End position in text– Matched Text– Confidence (not shown)

• Zero or more Annotations possible per input text.

Page 19: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Loading Dictionaries

• Dictionary entries represented by:

– Lexicon Name

– Entry ID (unique across lexicons)

– List of possible synonym terms

• JSON Request to add an entry for MeSH dictionary.

{ “id”: http://id.nlm.nih.gov/mesh/2015/M0021699,

“lexicon”: “mesh”,

“names”: [“Baby Tooth”, “Dentitions, Primary”, “Milk Tooth”, ...],

“commit”: false }

• Preferable to commit periodically and after batch.

Page 20: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Loading Dictionaries – Scala Client

Page 21: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Scaling Out

Page 22: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

SoDA Performance – Expected

• Test: annotate 14M docs in “reasonable time”.

– Approx. 3s/doc with SoDA+Solr on ec2 r3.large box (15.5GB RAM, 32GB SSD, 2vCPU).

– Total estimated time: 16.2 months!

• Questions

– Can we make the process faster?

– Can we scale out the process?

Page 23: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Where is the time being spent?

• Majority of time spent in Solr.

• Some time spent in SoDA (decreases slower than Solr as transactions get shorter).

• Almost no additional time spent in Spark.

Page 24: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Optimization #1: Combine Paragraphs

• Performance measured using 10K random articles.

• Time to annotate 1 article: Mean 2.9s, Median 2.1s.

• Annotation done per paragraph, 40 paragraphs/article on average.

• Reduce HTTP network + parsing overhead by sending full document.

• Time to annotate 1 article: Mean 1.4s, Median 0.3s.

• 2x - 7x improvement.

Page 25: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Optimization #2: Tune Solr GC

• OOB Solr would GC very frequently, slowing down Spark and causing timeouts.

• Current Index Size: 2.1 GB

• Need to size box so approximately 75% RAM given to OS and remaining 25% allocated to Solr (Uwe Schindler's Blog Post).

• Heap size should be 3-4x index size (Internal Guideline).

• Current Solr Heap Size = 8 GB

• RAM is 30.5 GB

• CMS (Concurrent Mark-Sweep) Garbage Collection.

Page 26: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Optimization #3: Larger Spark Cluster

• Running on cluster Master + 4 Workers.

• Each worker has 1 Executor with 4 Cores.

• Number of simultaneous Solr clients = 16 (4 workers * 1 executor * 4 cores) – measured with lsof –p in a loop on Solr server.

• Throughput increases with number of partitions till about 2x the number of worker cores.

• Best throughput 5 docs/sec with #-partitions=30 for 16 cores.

Page 27: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Optimization #4: Solr Scaleout

• Upgrade to r3.xlarge (30.5GB RAM, 80GB SSD, 4vCPU)

– Throughput 7.9 docs/s

• Upgrade to 2x r3.2xlarge (61GB RAM, 160GB SSD, 8vCPU) with c3.large LB (3.75GB RAM, 32GB Disk, 2vCPU) running HAProxy.

#-workers #-requests/server

Throughput (docs/sec)

4 8 8.62

8 16 17.334

12 24 20.64

16 32 26.845

Page 28: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Performance – Did we meet expectations?

• At 26 docs/sec and 14M documents, it will take our current cluster little over 6 days to annotate against our largest dictionary (8M entries).

• Throughput scales linearly @ 1.5 docs/sec per additional worker, as long as Solr servers have enough capacity to serve requests.

• Each Solr box (as configured) can serve sustained loads of up to 30-35 simultaneous requests.

• Number of simultaneous requests approximately equal to number of worker cores.

• Example: annotate 14M documents in 3 days.

– Throughput required: 14M / (3 * 86400) = 54 docs/s

– Number of workers: 54 / 1.5 = 36 workers

– Number of simultaneous requests (4 cores/worker) = 36 * 4 = 144

– Number of Solr servers: 144 / 32 = 4.5 = 5 servers

Page 29: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Future Work

• More Lexicons

• Investigate Lexicon-Centric scale out.

– Allows more lexicons.

– Not limited to single index.

• Move to Lucene, eliminate network overhead.

– Asynchronous model

– Use Kafka topic with multiple partitions

– Lucene based tagging consumers

– Write output to S3.

Page 30: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Q&A

Page 31: Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs

Thank you for listening!

• Questions?

• SoDA available on GitHub

– https://github.com/elsevierlabs-os/soda

• Contact me

[email protected]