SplunkLive! London 2017 - Using Machine Learning to Feed Hungry People

© 2017 SPLUNK INC.

Feeding Hungry People With Machine Learning

Duncan Turnbull | Technical Lead, Business Analytics & IoT, EMEA

11TH MAY 2017 | LONDON

© 2017 SPLUNK INC.

2

Run the Business in Real-time

Data From the Past Real-time Data Statistical Forecast

T – a few days T + a few days

Security Operations Center

IT Operations Center

Business Operations Center

Predictive(Models)

Descriptive (BI Tools, Data Lakes) Grey space

© 2017 SPLUNK INC.

Overview of ML at Splunk

Core Platform SearchPackaged Premium

SolutionsCustom ML

Platform for Operational Intelligence

© 2017 SPLUNK INC.

Machine Learning is Not Magic

▶ … it’s a process.

▶ Data Preparation is about 80%*

4

Collect

Data

Explore/

Visualize

Model

Evaluate

Clean/

Transform

Publish/

Deploy

* “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”, Forbes Mar 23, 2016

© 2017 SPLUNK INC.

Splunk is a playground for Data Preparation

5

Collect

Data

Explore/

Visualize

Model

Evaluate

Clean/

Transform

Publish/

Deploy

props.conf,

transforms.conf,

Datamodels

Add-ons from Splunkbase

Pivot, Table UI,

SPLML Toolkit

Alerts,

Dashboards,

Reports

© 2017 SPLUNK INC.

ML-SPL Commands: A “grammar” for ML

Fit (i.e. train) a model from search results

… | fit <ALGORITHM> <TARGET> from <VARIABLES …><PARAMETERS> into <MODEL>

Apply a model to obtain predictions from (new) search results

… | apply <MODEL>

Inspect the model inferred by <ALGORITHM> (e.g. display coefficients)

| summary <MODEL>

© 2017 SPLUNK INC.

ML-SPL Commands: A “grammar” for ML

Fit (i.e. train) a model from search results

… | fit <ALGORITHM> <TARGET> from <VARIABLES …><PARAMETERS> into <MODEL>

Apply a model to obtain predictions from (new) search results

… | apply <MODEL>

Inspect the model inferred by <ALGORITHM> (e.g. display coefficients)

| summary <MODEL>

© 2017 SPLUNK INC.

fit: How It Works

8

1. Discard fields that are null for all search results.

2. Discard non-numeric fields with >100 distinct values.

3. Discard search results with any null fields.

4. Convert non-numeric fields to binary indicator variables(i.e. “dummy coding”).

5. Convert to a numeric matrix and hand over to <ALGORITHM>.

6. Compute predictions for all search results.

7. Save the learned model.

© 2017 SPLUNK INC.

fit: How It Works

9


field_A field_B field_C field_D field_E

ok 41 red 172.24.16.5

ok 32 green 192.168.0.2

FRAUD 1 blue 10.6.6.6

ok 43 171.64.72.1

2 blue 192.168.0.2

Target Explanatory Variables…

… | fit LogisticRegression field_A from field_*

© 2017 SPLUNK INC.

fit: How It Works

10


field_A field_B field_D field_E

ok 41 red 172.24.16.5

ok 32 green 192.168.0.2

FRAUD 1 blue 10.6.6.6

ok 43 171.64.72.1

2 blue 192.168.0.2



© 2017 SPLUNK INC.

fit: How It Works

11

3. Discard search results with any null fields.

field_A field_B field_D

ok 41 red

ok 32 green

FRAUD 1 blue

ok 43

2 blue



© 2017 SPLUNK INC.

fit: How It Works

12


ok 41 red

ok 32 green

FRAUD 1 blue



4. Convert non-numeric fields to binary indicator variables.

field_A field_B field_D=r

ed

…=green …=blue

ok 41 1 0 0

ok 32 0 1 0

FRAUD 1 0 0 1

© 2017 SPLUNK INC.

fit: How It Works

13


y = X =


[1, 1, 0] [[41, 1, 0, 0],

[32, 0, 1, 0],

[1, 0, 0, 1]]

𝑦 =1

1 + 𝑒−(𝜃𝑇𝑥)Find 𝜃 using maximum likelihood estimation.

e.g. for Logistic Regression:

Model inference generally delegated to scikit-learn and statsmodels.

(e.g. sklearn.linear_model.LogisticRegression)

© 2017 SPLUNK INC.

fit: How It Works

14


field_A field_B field_C field_D field_E predicted(field_A)

ok 41 red 172.24.16.5 ok

ok 32 green 192.168.0.2 ok

FRAUD 1 blue 10.6.6.6 FRAUD

ok 43 171.64.72.1 ok

2 blue 192.168.0.2 FRAUD



Prediction

© 2017 SPLUNK INC.

fit: How It Works

15

7. Save the learned model.

Serialize model settings, coefficients, etc. into aSplunk lookup table.

• Replicated amongst members of Search Head Cluster.

• Automatically distributed to Indexers with search bundle.

… | fit LogisticRegression field_A from field_* into logreg_model

© 2017 SPLUNK INC.

fit: Scalability

16

▶ Some algorithms support incremental fitting, e.g.:SGDRegressor, SGDClassifier, NaiveBayes

• Use “partial_fit=t” option with fit command.

• No sampling, no event limit!

▶ Some algorithms are inherently not scalable.

• e.g. Kernel-based Support Vector Machines is 𝑂 𝑁3

▶ Input is down-sampled using reservoir sampling.

• Per-algorithm sample reservoir size, typically 100,000 events

• Configurable in mlspl.conf.

▶ For the most part, you don’t need to care.

© 2017 SPLUNK INC.

ML-SPL Commands: apply

17

… | apply <MODEL>

Examples:

… | apply temp_model

… | apply user_behavior_clusters

… | apply petal_length_from_species

© 2017 SPLUNK INC.

apply: How It Works

18

1. Load the learned model.



4. Convert non-numeric fields to binary indicator variables(i.e. “dummy coding”).

5. Discard variables not in the learned model.

6. Fill missing fields with 0’s.



© 2017 SPLUNK INC.


ok 41 red

ok 32 green

FRAUD 1 blue

41 yellow


ed

…=green …=blue …=yello

w

ok 41 1 0 0 0

ok 32 0 1 0 0

FRAUD 1 0 0 1 0

41 0 0 0 1

apply: How It Works

19


… | apply fraud_model

4. Convert non-numeric fields to binary indicator variables.

© 2017 SPLUNK INC.


ed

…=green …=blue …=yello

w

ok 41 1 0 0 0

ok 32 0 1 0 0

FRAUD 1 0 0 1 0

41 0 0 0 1

apply: How It Works

20



5. Discard variables not in the learned model.

© 2017 SPLUNK INC.

apply: How It Works

21


y = X =


[1, 1, 0, 1, ?] [[41, 1, 0, 0],

[32, 0, 1, 0],

[1, 0, 0, 1],

[41, 0, 0, 0]]

𝑦 =1

1 + 𝑒−(𝜃𝑇𝑥)Compute 𝑦 using θ found by fit command.

e.g. for Logistic Regression:

© 2017 SPLUNK INC.

apply: How It Works

22


field_A field_B field_C field_D field_E predicted(field_A

)

ok 41 red 172.24.16.5 ok

ok 32 green 192.168.0.2 ok

FRAUD 1 blue 10.6.6.6 FRAUD

ok 43 171.64.72.1 ok

41 yellow 192.168.0.2 ok



Prediction

© 2017 SPLUNK INC.

apply: Properties

23

▶ Learned models can be applied to new, unseen data.

| fit is to | apply

as

| outputlookup is to | lookup

▶ Resilient to missing values. (but, again, be careful!)

▶ Automatically handles categorical (e.g. non-numeric) fields.

© 2017 SPLUNK INC.

apply: Scalability

24

▶ No limits.

▶ When possible, executes at the Indexing tier.

• Fully parallelized; harness the CPU power of your Indexing Cluster.

• Must set “streaming_apply = true” in mlspl.conf.

© 2017 SPLUNK INC.

ML-SPL Commands: summary

25

… | summary <MODEL>

Examples:

… | summary temp_model

… | summary user_behavior_clusters

… | summary petal_length_from_species

© 2017 SPLUNK INC.

“Pipeline” Multiple Algorithms

26

▶ ML-SPL analytics are stackable.

▶ Very advanced ML use-cases are succinctly expressible.

© 2017 SPLUNK INC.

ITSI,

UBA

DomainExpertise

(IT, Security, …)

Data Science

Expertise

Splunk Expertise

Custom Machine Learning – Success Formula

Identify use cases

Drive decisions

Set business/ops

priorities

SPL

Data prep

Statistics / math background

Algorithm selection

Model building

Splunk ML Toolkit

facilitates and simplifies

via examples & guidance

Operational

success

© 2017 SPLUNK INC.

28

Sense and Respond

Real Time Search Alert

Third-Party

Application

s

Smartphones

and Devices

Tickets

Email

Send an

email

File a ticket

Send a text

Flash lights

Trigger

process flow

28

OT

Industrial Assets

IT

Consumer and

Mobile Devices

Every Search Can Use Machine Learning

© 2017 SPLUNK INC.

▶ Point of Sale

▶ Loyalty Card

▶ Mobile Apps

▶ Marketing Push Messages

Data for Retail TransactionsWhat might we need to give insights?

© 2017 SPLUNK INC.

SplunkDemo

© 2017 SPLUNK INC.

1. Identify an action or intervention

2. Collect, enrich and prepare data

3. Apply Machine Learning

4. Generate alerts to drive actions

5. Keep history for continuous improvement

Operationalize Machine Learning with Splunk

Key Takeaways

© 2017 SPLUNK INC.

SEPT 25-28, 2017Walter E. Washington Convention Center Washington, D.C.

.conf2017The 8th Annual Splunk Conference

conf .sp lunk.com

You will receive an email after registration opens with a link to save over $450 on the full conference rate.You’ll have 30 days to take advantage of this special promotional rate!

SAVE OVER $450

© 2017 SPLUNK INC.

Take the Survey on Pony Poll

ponypoll.com/london17

© 2017 SPLUNK INC.© 2017 SPLUNK INC.

THANK YOU

© 2017 SPLUNK INC.

BREAK

Technology

SplunkLive! London 2017 - Using Machine Learning to Feed Hungry People