18
Michelle Casbon January 12, 2016 – Advanced Apache Spark Meetup Training & Serving NLP Models in a Distributed Cloud-based Infrastructure

Advanced Spark Meetup - Jan 12, 2016

Embed Size (px)

Citation preview

Page 1: Advanced Spark Meetup - Jan 12, 2016

Michelle CasbonJanuary 12, 2016 – Advanced Apache Spark Meetup

Training & Serving NLP Models in a Distributed Cloud-based Infrastructure

Page 2: Advanced Spark Meetup - Jan 12, 2016

2

What do we do?• Idibon creates

adaptive machine intelligence that can analyze text in any language

natural language text

social media

structured insights

Page 3: Advanced Spark Meetup - Jan 12, 2016

3

• Background• Platform description• Why we chose Spark• How we’re using Spark ML & MLlib• Challenges of adopting Spark in a distributed NLP

system

Agenda

Page 4: Advanced Spark Meetup - Jan 12, 2016

4

What are our use cases?

Intent to purchase

Global health trends

Interactive Voice Response

Multilingual news

SMS Prioritization

Supply Chain Risk

Change reception

Page 5: Advanced Spark Meetup - Jan 12, 2016

How do we do it?

Page 6: Advanced Spark Meetup - Jan 12, 2016

• Fewer annotations• Lower costs• Less time spent training• Higher accuracy• Improves over time

labeled training set

human annotation intelligent queuing&

machine learning

unlabeled poolAdaptive learning

Page 7: Advanced Spark Meetup - Jan 12, 2016

7

How do we do it?Dataset

Models

Identification2

Cleansing3

Training data creation4

Quality Control5

Creation6

Hyperparameter Tuning7

Intelligent Queueing

8

Rule Creation910 Unseen Data

Prediction

Goal Definition1

Page 8: Advanced Spark Meetup - Jan 12, 2016

What does our platform look like?

Page 9: Advanced Spark Meetup - Jan 12, 2016

9

• Wide variety of algorithms• Active development• Open source• Industry-standard algorithm implementation• Intended for use in enterprise applications• Scalability

Why are we using Spark?

Page 10: Advanced Spark Meetup - Jan 12, 2016

10

• Feature Extraction• TF-IDF• Word2Vec• Dimensionality reduction

• Training• Logistic Regression• SVM• Naïve Bayes• LDA

• Prediction• Evaluation metrics

How are we using Spark?

[1.0, [1.0, 0.0, 3.0]]

Feature Extraction

Training

Prediction

Page 11: Advanced Spark Meetup - Jan 12, 2016

11

Feature Extraction

Extract Content Tokenize

Bigrams

Trigrams

Feature Lookup

[1.0, 0.0, 3.0]

Vector

Page 12: Advanced Spark Meetup - Jan 12, 2016

12

Training

LogisticRegressionWithLBFGS

[1.0, [1.0, 0.0, 3.0]]

LabeledPoint

Model Storage

[1.0, 0.0, 3.0]

Vector

Add classification

LogisticRegressionModel

Page 13: Advanced Spark Meetup - Jan 12, 2016

13

Prediction

Extract Content Tokenize

Bigrams

Trigrams

Feature Lookup

[0.0, 1.0, 4.0]

Vector

Model Lookup

Predict

New tweet

[0.0, 1.0, 4.0]

Vector

Classification Lookup

Page 14: Advanced Spark Meetup - Jan 12, 2016

14

How do we provide online predictions with Spark?

… if you have small data

Task Time in µs

Vector prediction 300

DataFrame prediction 7800

DataFrames are slow ...

Page 15: Advanced Spark Meetup - Jan 12, 2016

15

How do we fit Spark into our existing system?

Core functionality

Idiboncustom ML

REST API

ML persistence layer

Page 16: Advanced Spark Meetup - Jan 12, 2016

16

• Real-time operationalization of many, many models• Embed within different platforms• Single save/load framework• Rapidly incorporate new NLP features• Logging/monitoring standardized & abstracted

How does a persistence layer enable us to use Spark?

Page 17: Advanced Spark Meetup - Jan 12, 2016

17

• Analyzing human language is hard• We’re using the most exciting parts of Spark to

build performant NLP systems that are faster & better than ever before

Summary

Page 18: Advanced Spark Meetup - Jan 12, 2016

18

Questions?Michelle Casbon

[email protected]@texasmichelle