The Challenges of Bringing Machine Learning to the Masses

The Challenges of
Bringing Machine
Learning to the
Masses
Alice Zheng and Sethu Raman
GraphLab Inc.
NIPS workshop on Software Engineering for Machine Learning
December 13, 2014

Self introduction
ML Research
“Accessible ML”

The need for accessible ML
• So much potential in ML
• Everyone trying to make sense of their data
• ML is transforming lives and industries:
personalized medicine, internet search, social
networks, advertising, etc.
• But success is unattainable to most

Building a predictive app
Was using 217 business rules
hoping world doesn’t change
Have an inspiring idea to
reinvent their business
Key pains:
Hiring Talent
Shortfall in data-savvy workers
needed to make sense out of
big data by 2018 [McKinsey 2011]
35%
Noisy Space of Tools
Data scientists use a variety of tools, across
different programming languages…
require a lot of context-switching…
affects productivity and impedes reproducibility.
Ben Lorica,
Data Analysis: Just one component of
the Data Science workflow

Building a predictive app
Feature
engineering
Model
definition
Training
evaluation
Data
DeploymentMonitoring

Pure ML is not enough
• Building a predictive application involves much
more than just building ML models
• System engineering: data storage, computation
infrastructure, networking…
• Data Science: problem definition, data cleaning,
feature engineering
• Software development: turn prototype model into
bullet-proof production code
• Operations engineering: deploy and monitor app
• …

Pain points
• What are the right features?
• What model should I use?
• How do I train it?
• How do I set the tuning parameters?
• Do I even have the right data?
• Ok, I have a working prototype, now what?

Pain points
• Increase in data size or decrease in
latency requires complete rewrite of code
and new toolset
• GB – R/scikit-learn/Matlab
• TB-PB—Hadoop/Mahout/Spark
• Many forms of data and data structures
• Images, text, speech, logs
• Dense lists, sparse dictionaries, time series
• Tables, graphs, matrices, tensors

The need for an ML platform
• Minimize tool/code switching, maximize
performance (speed/accuracy/scale)
• Graceful transition from small to large
dataset sizes
• Flexible, interoperable data types
• Minimize complexity
• System-agnostic
• Simple API
• Auto-tune parameters

The parallel to databases
• What’s an example of a mega-successful
platform for data operations?
• Databases!
• SQL, Oracle, NoSQL, …
• What lessons can we bring in from the
database world?

Database engine components
Storage
engine
Query
execution
Query
optimizer
Storage

Storage
engine
Query
execution
Query
optimizer
Storage
Complex but self-contained, has clean API,
only changes when there’s new hardware.

Storage
engine
Query
execution
Query
optimizer
Storage
Complex bag of tricks, no formalism,
constantly changing to adapt to
data, query, disk characteristics.

ML engine components
Feature
engineering
Model
definition
Training
evaluation
Data
Bags of tricks,
expert knowledge,
experience,
lots of trial and error

Advances in databases
• Reasonable abstraction—relational DB
• Hardware speedups
• Pragmatic software implementation
Successful platform
• Take-away lesson: fast computation
engine + “good enough” execution plan

To advance ML platforms
• ML will be end-user friendly when the
platform is clever enough to handle less-
than-optimal directions from the user
• What needs to happen?
• The complexity needs to be automated and
wrapped away with neat interfaces between
components
• Fast components, “good enough” directions

GraphLab
• Started as a research project at CMU in
2009
• Now a Seattle-based startup

The GraphLab CreateTM Solution
• Flexible, interoperable data types
• SArray+SFrame+SGraph inter-translatable
• dense list, sparse array, image, text, tables, graphs
• Graceful transition between data sizes
• SFrame: memory to disk to distributed
• One environment, many substrates
• Python front-end
• Localhost, cluster, Hadoop, EC2
• End-to-end
• Data ingestion+feature engineering+model building+
deployment in a single environment

GraphLab Create ML Toolkits
Machine Learning Task
Business
Task
Algorithms & SDK
Recommender, Target, Social
Match, …
Regression, Classification,
Data Matching,…
SVM, Matrix
Factorization, LDA, …
Developers
Savvy Dev
& Data Sci.
ML
experts

GLC SDK example
• Task: fill in missing value in an array using
previous value
• Existing solution:
• E.g., use Pandas—Python library providing in-
memory dataframes
• Problem:
• Given, say, 25M rows and 50 cols, takes
forever to even load the data

GLC SDK solution
> cat fill.cpp
#include <flexible_type/flexible_type.hpp>
#include <unity/lib/toolkit_function_macros.hpp>
#include <unity/lib/gl_sarray.hpp>
using namespace graphlab;
gl_sarray fill(gl_sarray sa) {
gl_sarray_writer writer(sa.dtype(), 1);
flexible_type last_value = sa[0];
for (const auto &elem: sa.range_iterator()) {
if (elem != FLEX_UNDEFINED)
last_value = elem;
writer.write(last_value, 0);
}
return writer.close();
}
BEGIN_FUNCTION_REGISTRATION
REGISTER_FUNCTION(fill, "sa");
END_FUNCTION_REGISTRATION

GLC SDK solution
> cat Makefile
all: fill.so
fill.so: fill.cpp
g++ -std=c++11 $^ -l graphlab –l ~/graphlab-dev/deps/shared-fPIC
–o $@ -O3
> python
>>> import graphlab as gl
>>> gl.ext_import(‘fill.so’, ‘example’)
>>> sa = gl.Sarray([1, 2, 3, None, 6])
>>> print gl.extensions.example.fill.fill(sa)
[1, 2, 3, 3, 6]

Join the revolution!
• Research methods to make the following
efficient and automatic:
• Feature engineering
• Model selection
• Model debugging
• Problem formulation (??)
• Develop novel algorithms on top of our SDK
• Backed by scalable, flexible typed data structures
• Automatic Python wrappers
• Make them available to many other peple
• We’re hiring! jobs@graphlab.com

The Challenges of Bringing Machine Learning to the Masses

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to The Challenges of Bringing Machine Learning to the Masses

Similar to The Challenges of Bringing Machine Learning to the Masses (20)

Recently uploaded

Recently uploaded (20)

The Challenges of Bringing Machine Learning to the Masses