Upload
splunk
View
382
Download
0
Embed Size (px)
Citation preview
© 2017 SPLUNK INC.
Feeding Hungry People With Machine Learning
Duncan Turnbull | Technical Lead, Business Analytics & IoT, EMEA
11TH MAY 2017 | LONDON
© 2017 SPLUNK INC.
2
Run the Business in Real-time
Data From the Past Real-time Data Statistical Forecast
T – a few days T + a few days
Security Operations Center
IT Operations Center
Business Operations Center
Predictive(Models)
Descriptive (BI Tools, Data Lakes) Grey space
© 2017 SPLUNK INC.
Overview of ML at Splunk
Core Platform SearchPackaged Premium
SolutionsCustom ML
Platform for Operational Intelligence
© 2017 SPLUNK INC.
Machine Learning is Not Magic
▶ … it’s a process.
▶ Data Preparation is about 80%*
4
Collect
Data
Explore/
Visualize
Model
Evaluate
Clean/
Transform
Publish/
Deploy
* “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”, Forbes Mar 23, 2016
© 2017 SPLUNK INC.
Splunk is a playground for Data Preparation
5
Collect
Data
Explore/
Visualize
Model
Evaluate
Clean/
Transform
Publish/
Deploy
props.conf,
transforms.conf,
Datamodels
Add-ons from Splunkbase
Pivot, Table UI,
SPLML Toolkit
Alerts,
Dashboards,
Reports
© 2017 SPLUNK INC.
ML-SPL Commands: A “grammar” for ML
Fit (i.e. train) a model from search results
… | fit <ALGORITHM> <TARGET> from <VARIABLES …><PARAMETERS> into <MODEL>
Apply a model to obtain predictions from (new) search results
… | apply <MODEL>
Inspect the model inferred by <ALGORITHM> (e.g. display coefficients)
| summary <MODEL>
© 2017 SPLUNK INC.
ML-SPL Commands: A “grammar” for ML
Fit (i.e. train) a model from search results
… | fit <ALGORITHM> <TARGET> from <VARIABLES …><PARAMETERS> into <MODEL>
Apply a model to obtain predictions from (new) search results
… | apply <MODEL>
Inspect the model inferred by <ALGORITHM> (e.g. display coefficients)
| summary <MODEL>
© 2017 SPLUNK INC.
fit: How It Works
8
1. Discard fields that are null for all search results.
2. Discard non-numeric fields with >100 distinct values.
3. Discard search results with any null fields.
4. Convert non-numeric fields to binary indicator variables(i.e. “dummy coding”).
5. Convert to a numeric matrix and hand over to <ALGORITHM>.
6. Compute predictions for all search results.
7. Save the learned model.
© 2017 SPLUNK INC.
fit: How It Works
9
1. Discard fields that are null for all search results.
field_A field_B field_C field_D field_E
ok 41 red 172.24.16.5
ok 32 green 192.168.0.2
FRAUD 1 blue 10.6.6.6
ok 43 171.64.72.1
2 blue 192.168.0.2
Target Explanatory Variables…
… | fit LogisticRegression field_A from field_*
© 2017 SPLUNK INC.
fit: How It Works
10
2. Discard non-numeric fields with >100 distinct values.
field_A field_B field_D field_E
ok 41 red 172.24.16.5
ok 32 green 192.168.0.2
FRAUD 1 blue 10.6.6.6
ok 43 171.64.72.1
2 blue 192.168.0.2
Target Explanatory Variables…
… | fit LogisticRegression field_A from field_*
© 2017 SPLUNK INC.
fit: How It Works
11
3. Discard search results with any null fields.
field_A field_B field_D
ok 41 red
ok 32 green
FRAUD 1 blue
ok 43
2 blue
Target Explanatory Variables…
… | fit LogisticRegression field_A from field_*
© 2017 SPLUNK INC.
fit: How It Works
12
field_A field_B field_D
ok 41 red
ok 32 green
FRAUD 1 blue
Target Explanatory Variables…
… | fit LogisticRegression field_A from field_*
4. Convert non-numeric fields to binary indicator variables.
field_A field_B field_D=r
ed
…=green …=blue
ok 41 1 0 0
ok 32 0 1 0
FRAUD 1 0 0 1
© 2017 SPLUNK INC.
fit: How It Works
13
5. Convert to a numeric matrix and hand over to <ALGORITHM>.
y = X =
… | fit LogisticRegression field_A from field_*
[1, 1, 0] [[41, 1, 0, 0],
[32, 0, 1, 0],
[1, 0, 0, 1]]
𝑦 =1
1 + 𝑒−(𝜃𝑇𝑥)Find 𝜃 using maximum likelihood estimation.
e.g. for Logistic Regression:
Model inference generally delegated to scikit-learn and statsmodels.
(e.g. sklearn.linear_model.LogisticRegression)
© 2017 SPLUNK INC.
fit: How It Works
14
6. Compute predictions for all search results.
field_A field_B field_C field_D field_E predicted(field_A)
ok 41 red 172.24.16.5 ok
ok 32 green 192.168.0.2 ok
FRAUD 1 blue 10.6.6.6 FRAUD
ok 43 171.64.72.1 ok
2 blue 192.168.0.2 FRAUD
Target Explanatory Variables…
… | fit LogisticRegression field_A from field_*
Prediction
© 2017 SPLUNK INC.
fit: How It Works
15
7. Save the learned model.
Serialize model settings, coefficients, etc. into aSplunk lookup table.
• Replicated amongst members of Search Head Cluster.
• Automatically distributed to Indexers with search bundle.
… | fit LogisticRegression field_A from field_* into logreg_model
© 2017 SPLUNK INC.
fit: Scalability
16
▶ Some algorithms support incremental fitting, e.g.:SGDRegressor, SGDClassifier, NaiveBayes
• Use “partial_fit=t” option with fit command.
• No sampling, no event limit!
▶ Some algorithms are inherently not scalable.
• e.g. Kernel-based Support Vector Machines is 𝑂 𝑁3
▶ Input is down-sampled using reservoir sampling.
• Per-algorithm sample reservoir size, typically 100,000 events
• Configurable in mlspl.conf.
▶ For the most part, you don’t need to care.
© 2017 SPLUNK INC.
ML-SPL Commands: apply
17
… | apply <MODEL>
Examples:
… | apply temp_model
… | apply user_behavior_clusters
… | apply petal_length_from_species
© 2017 SPLUNK INC.
apply: How It Works
18
1. Load the learned model.
2. Discard fields that are null for all search results.
3. Discard non-numeric fields with >100 distinct values.
4. Convert non-numeric fields to binary indicator variables(i.e. “dummy coding”).
5. Discard variables not in the learned model.
6. Fill missing fields with 0’s.
7. Convert to a numeric matrix and hand over to <ALGORITHM>.
8. Compute predictions for all search results.
© 2017 SPLUNK INC.
field_A field_B field_D
ok 41 red
ok 32 green
FRAUD 1 blue
41 yellow
field_A field_B field_D=r
ed
…=green …=blue …=yello
w
ok 41 1 0 0 0
ok 32 0 1 0 0
FRAUD 1 0 0 1 0
41 0 0 0 1
apply: How It Works
19
Target Explanatory Variables…
… | apply fraud_model
4. Convert non-numeric fields to binary indicator variables.
© 2017 SPLUNK INC.
field_A field_B field_D=r
ed
…=green …=blue …=yello
w
ok 41 1 0 0 0
ok 32 0 1 0 0
FRAUD 1 0 0 1 0
41 0 0 0 1
apply: How It Works
20
Target Explanatory Variables…
… | apply fraud_model
5. Discard variables not in the learned model.
© 2017 SPLUNK INC.
apply: How It Works
21
5. Convert to a numeric matrix and hand over to <ALGORITHM>.
y = X =
… | apply fraud_model
[1, 1, 0, 1, ?] [[41, 1, 0, 0],
[32, 0, 1, 0],
[1, 0, 0, 1],
[41, 0, 0, 0]]
𝑦 =1
1 + 𝑒−(𝜃𝑇𝑥)Compute 𝑦 using θ found by fit command.
e.g. for Logistic Regression:
© 2017 SPLUNK INC.
apply: How It Works
22
7. Compute predictions for all search results.
field_A field_B field_C field_D field_E predicted(field_A
)
ok 41 red 172.24.16.5 ok
ok 32 green 192.168.0.2 ok
FRAUD 1 blue 10.6.6.6 FRAUD
ok 43 171.64.72.1 ok
41 yellow 192.168.0.2 ok
Target Explanatory Variables…
… | apply fraud_model
Prediction
© 2017 SPLUNK INC.
apply: Properties
23
▶ Learned models can be applied to new, unseen data.
| fit is to | apply
as
| outputlookup is to | lookup
▶ Resilient to missing values. (but, again, be careful!)
▶ Automatically handles categorical (e.g. non-numeric) fields.
© 2017 SPLUNK INC.
apply: Scalability
24
▶ No limits.
▶ When possible, executes at the Indexing tier.
• Fully parallelized; harness the CPU power of your Indexing Cluster.
• Must set “streaming_apply = true” in mlspl.conf.
© 2017 SPLUNK INC.
ML-SPL Commands: summary
25
… | summary <MODEL>
Examples:
… | summary temp_model
… | summary user_behavior_clusters
… | summary petal_length_from_species
© 2017 SPLUNK INC.
“Pipeline” Multiple Algorithms
26
▶ ML-SPL analytics are stackable.
▶ Very advanced ML use-cases are succinctly expressible.
© 2017 SPLUNK INC.
ITSI,
UBA
DomainExpertise
(IT, Security, …)
Data Science
Expertise
Splunk Expertise
Custom Machine Learning – Success Formula
Identify use cases
Drive decisions
Set business/ops
priorities
SPL
Data prep
Statistics / math background
Algorithm selection
Model building
Splunk ML Toolkit
facilitates and simplifies
via examples & guidance
Operational
success
© 2017 SPLUNK INC.
28
Sense and Respond
Real Time Search Alert
Third-Party
Application
s
Smartphones
and Devices
Tickets
Send an
File a ticket
Send a text
Flash lights
Trigger
process flow
28
OT
Industrial Assets
IT
Consumer and
Mobile Devices
Every Search Can Use Machine Learning
© 2017 SPLUNK INC.
▶ Point of Sale
▶ Loyalty Card
▶ Mobile Apps
▶ Marketing Push Messages
Data for Retail TransactionsWhat might we need to give insights?
© 2017 SPLUNK INC.
1. Identify an action or intervention
2. Collect, enrich and prepare data
3. Apply Machine Learning
4. Generate alerts to drive actions
5. Keep history for continuous improvement
Operationalize Machine Learning with Splunk
Key Takeaways
© 2017 SPLUNK INC.
SEPT 25-28, 2017Walter E. Washington Convention Center Washington, D.C.
.conf2017The 8th Annual Splunk Conference
conf .sp lunk.com
You will receive an email after registration opens with a link to save over $450 on the full conference rate.You’ll have 30 days to take advantage of this special promotional rate!
SAVE OVER $450