Upload
michelle-casbon
View
984
Download
2
Embed Size (px)
Citation preview
Michelle CasbonJanuary 12, 2016 – Advanced Apache Spark Meetup
Training & Serving NLP Models in a Distributed Cloud-based Infrastructure
2
What do we do?• Idibon creates
adaptive machine intelligence that can analyze text in any language
natural language text
social media
structured insights
3
• Background• Platform description• Why we chose Spark• How we’re using Spark ML & MLlib• Challenges of adopting Spark in a distributed NLP
system
Agenda
4
What are our use cases?
Intent to purchase
Global health trends
Interactive Voice Response
Multilingual news
SMS Prioritization
Supply Chain Risk
Change reception
How do we do it?
• Fewer annotations• Lower costs• Less time spent training• Higher accuracy• Improves over time
labeled training set
human annotation intelligent queuing&
machine learning
unlabeled poolAdaptive learning
7
How do we do it?Dataset
Models
Identification2
Cleansing3
Training data creation4
Quality Control5
Creation6
Hyperparameter Tuning7
Intelligent Queueing
8
Rule Creation910 Unseen Data
Prediction
Goal Definition1
What does our platform look like?
9
• Wide variety of algorithms• Active development• Open source• Industry-standard algorithm implementation• Intended for use in enterprise applications• Scalability
Why are we using Spark?
10
• Feature Extraction• TF-IDF• Word2Vec• Dimensionality reduction
• Training• Logistic Regression• SVM• Naïve Bayes• LDA
• Prediction• Evaluation metrics
How are we using Spark?
[1.0, [1.0, 0.0, 3.0]]
Feature Extraction
Training
Prediction
11
Feature Extraction
Extract Content Tokenize
Bigrams
Trigrams
Feature Lookup
[1.0, 0.0, 3.0]
Vector
12
Training
LogisticRegressionWithLBFGS
[1.0, [1.0, 0.0, 3.0]]
LabeledPoint
Model Storage
[1.0, 0.0, 3.0]
Vector
Add classification
LogisticRegressionModel
13
Prediction
Extract Content Tokenize
Bigrams
Trigrams
Feature Lookup
[0.0, 1.0, 4.0]
Vector
Model Lookup
Predict
New tweet
[0.0, 1.0, 4.0]
Vector
Classification Lookup
14
How do we provide online predictions with Spark?
… if you have small data
Task Time in µs
Vector prediction 300
DataFrame prediction 7800
DataFrames are slow ...
15
How do we fit Spark into our existing system?
Core functionality
Idiboncustom ML
…
REST API
ML persistence layer
16
• Real-time operationalization of many, many models• Embed within different platforms• Single save/load framework• Rapidly incorporate new NLP features• Logging/monitoring standardized & abstracted
How does a persistence layer enable us to use Spark?
17
• Analyzing human language is hard• We’re using the most exciting parts of Spark to
build performant NLP systems that are faster & better than ever before
Summary