7. 7
What is “Learning”?
[Prediction]
• When we see thick clouds and an overcast sky, we
predict that it’s likely going to rain
[Estimation/ Regression]
• Estimate how much an apartment costs based on its
location, condition and prices of properties in that
neighborhood
[Classification/ Clustering]
• Determine the gender of a person based on her/his
features, hair style and the way s/he dresses
[Anomaly Detection]
• Identify the odd one out
[Reinforcement Learning]
• If I made a mistake this time, can I do better next time?
All of us have had some
experience in learning.
But… what’s behind our
experience? How do we
translate that knowledge
to code?
17. 17
What’s New in 2.0?
1
7
• New name and abbreviation
• No event limits (removal of 50K
limit on fitting models)
• Configurable resource caps via
mlspl.conf
• Search head clustering support
• Distributed / streaming apply
• Scheduled fit
• New algorithms (see Slide 7)
– Feature engineering and selection
– Stochastic gradient descent (e.g.)
– ARIMA
• Multi-algorithm support across
Assistants
• Scatterplot matrix viz
• Alerting
• Tooltips
• In-app tours
• Cluster Numeric Events assistant
25. 25
Fit, Apply & Validate Models
ML SPL – New grammar for doing ML in Splunk
fit – fit models based on training data
– [training data] | fit LinearRegression costly_KPI from
feature1 feature2 feature3 into my_model
apply – apply models on testing and production data
– [testing/production data] | apply my_model
Validate Your Model (The Hard Part)
– Why hard? Because statistics is hard! Also: model error ≠ real world risk.
– Analyze residuals, mean-square error, goodness of fit, cross-validate, etc.
– Take Splunk’s Analytics & Data Science Education course
26. 26
ML Commands in Core SPL
Cluster – groups events together
based on how textually similar they are
to each other.
Anomalies – finds events or field
values that are unusual or unexpected
Predict - forecasts values for one or
more sets of time-series data
Kmeans - Kmeans clustering on events.
Anomalousvalue - anomaly score for
each field of each event, relative to the
values of this field across other events.
Anomalydetection - identifies
anomalous events by computing a
probability for each event and then
detecting unusually small probabilities.
• X11 - exposes seasonal trend in your
time series.
Associate – Change in entropy
between two fields.
Findkeywords - Given a set of
numbered groups (from say cluster)
calculates the common words found in each
cluster.
Analyzefields - What is the ability
of a set of fields to predict a single
field. Univariate analysis.
And lots more
Reference Docs -> http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/ListOfSearchCommands
27. 27
Don’t Forget: 80% of Data Science is Data Munging
Trendline – Moving Averages of fields
Erex- Use the erex command to extract
data from a field when you do not know
the regular expression to use
Correlation – Co-occurence NOT
correlation as per Pearson et., don’t mix
this up.
Autoregress – Copies one or more
previous values for a field into an
event.
Contingency - Frequency distribution
matrix.
Cofilter - find how many times field1
and field2 values occurred together.
Reference Docs -> http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/ListOfSearchCommands
Stats, Eventstats, Streamstats,
Timechart, Chart – stats reporting
Eval - evaluate new fields
And lots more
28. 28
What Now?
2
8
• Get the Machine Learning Toolkit from Splunkbase
• Go watch Machine Learning Videos on Splunk Youtube Channel http://tiny.cc/splunkmlvideos
• Go to Machine Learnings talks: https://conf.splunk.com/
– Advanced Machine Learning in SPL with the Machine Learning Toolkit by Jacob Leverich
– Extending SPL with Custom Search Commands and the Splunk SDK for Python by Jacob Leverich
• Several Customers and Partner Talks
– Cisco, Scianta Analytics, Asian Telco, etc.
• Early Adopter And Customer Advisory Program : mlprogram@splunk.com
• Product Manager: Manish Sainani ms@splunk.com
• Field Expert: Andrew Stein astein@splunk.com
http://tiny.cc/splunkmlapp