SlideShare a Scribd company logo
1 of 24
The Challenges of
Bringing Machine
Learning to the
Masses
Alice Zheng and Sethu Raman
GraphLab Inc.
NIPS workshop on Software Engineering for Machine Learning
December 13, 2014
Self introduction
ML Research
“Accessible ML”
The need for accessible ML
• So much potential in ML
• Everyone trying to make sense of their data
• ML is transforming lives and industries:
personalized medicine, internet search, social
networks, advertising, etc.
• But success is unattainable to most
Building a predictive app
Was using 217 business rules
hoping world doesn’t change
Have an inspiring idea to
reinvent their business
Key pains:
Hiring Talent
Shortfall in data-savvy workers
needed to make sense out of
big data by 2018 [McKinsey 2011]
35%
Noisy Space of Tools
Data scientists use a variety of tools, across
different programming languages…
require a lot of context-switching…
affects productivity and impedes reproducibility.
Ben Lorica,
Data Analysis: Just one component of
the Data Science workflow
Building a predictive app
Feature
engineering
Model
definition
Training
evaluation
Data
DeploymentMonitoring
Pure ML is not enough
• Building a predictive application involves much
more than just building ML models
• System engineering: data storage, computation
infrastructure, networking…
• Data Science: problem definition, data cleaning,
feature engineering
• Software development: turn prototype model into
bullet-proof production code
• Operations engineering: deploy and monitor app
• …
Pain points
• What are the right features?
• What model should I use?
• How do I train it?
• How do I set the tuning parameters?
• Do I even have the right data?
• Ok, I have a working prototype, now what?
Pain points
• Increase in data size or decrease in
latency requires complete rewrite of code
and new toolset
• GB – R/scikit-learn/Matlab
• TB-PB—Hadoop/Mahout/Spark
• Many forms of data and data structures
• Images, text, speech, logs
• Dense lists, sparse dictionaries, time series
• Tables, graphs, matrices, tensors
The need for an ML platform
• Minimize tool/code switching, maximize
performance (speed/accuracy/scale)
• Graceful transition from small to large
dataset sizes
• Flexible, interoperable data types
• Minimize complexity
• System-agnostic
• Simple API
• Auto-tune parameters
The parallel to databases
• What’s an example of a mega-successful
platform for data operations?
• Databases!
• SQL, Oracle, NoSQL, …
• What lessons can we bring in from the
database world?
Database engine components
Storage
engine
Query
execution
Query
optimizer
Storage
Database engine components
Storage
engine
Query
execution
Query
optimizer
Storage
Complex but self-contained, has clean API,
only changes when there’s new hardware.
Database engine components
Storage
engine
Query
execution
Query
optimizer
Storage
Complex bag of tricks, no formalism,
constantly changing to adapt to
data, query, disk characteristics.
ML engine components
Feature
engineering
Model
definition
Training
evaluation
Data
Bags of tricks,
expert knowledge,
experience,
lots of trial and error
Advances in databases
• Reasonable abstraction—relational DB
• Hardware speedups
• Pragmatic software implementation
Successful platform
• Take-away lesson: fast computation
engine + “good enough” execution plan
To advance ML platforms
• ML will be end-user friendly when the
platform is clever enough to handle less-
than-optimal directions from the user
• What needs to happen?
• The complexity needs to be automated and
wrapped away with neat interfaces between
components
• Fast components, “good enough” directions
GraphLab
• Started as a research project at CMU in
2009
• Now a Seattle-based startup
The GraphLab CreateTM Solution
• Flexible, interoperable data types
• SArray+SFrame+SGraph inter-translatable
• dense list, sparse array, image, text, tables, graphs
• Graceful transition between data sizes
• SFrame: memory to disk to distributed
• One environment, many substrates
• Python front-end
• Localhost, cluster, Hadoop, EC2
• End-to-end
• Data ingestion+feature engineering+model building+
deployment in a single environment
GraphLab Create ML Toolkits
Machine Learning Task
Business
Task
Algorithms & SDK
Recommender, Target, Social
Match, …
Regression, Classification,
Data Matching,…
SVM, Matrix
Factorization, LDA, …
Developers
Savvy Dev
& Data Sci.
ML
experts
Demos
GLC SDK example
• Task: fill in missing value in an array using
previous value
• Existing solution:
• E.g., use Pandas—Python library providing in-
memory dataframes
• Problem:
• Given, say, 25M rows and 50 cols, takes
forever to even load the data
GLC SDK solution
> cat fill.cpp
#include <flexible_type/flexible_type.hpp>
#include <unity/lib/toolkit_function_macros.hpp>
#include <unity/lib/gl_sarray.hpp>
using namespace graphlab;
gl_sarray fill(gl_sarray sa) {
gl_sarray_writer writer(sa.dtype(), 1);
flexible_type last_value = sa[0];
for (const auto &elem: sa.range_iterator()) {
if (elem != FLEX_UNDEFINED)
last_value = elem;
writer.write(last_value, 0);
}
return writer.close();
}
BEGIN_FUNCTION_REGISTRATION
REGISTER_FUNCTION(fill, "sa");
END_FUNCTION_REGISTRATION
GLC SDK solution
> cat Makefile
all: fill.so
fill.so: fill.cpp
g++ -std=c++11 $^ -l graphlab –l ~/graphlab-dev/deps/shared-fPIC
–o $@ -O3
> python
>>> import graphlab as gl
>>> gl.ext_import(‘fill.so’, ‘example’)
>>> sa = gl.Sarray([1, 2, 3, None, 6])
>>> print gl.extensions.example.fill.fill(sa)
[1, 2, 3, 3, 6]
Join the revolution!
• Research methods to make the following
efficient and automatic:
• Feature engineering
• Model selection
• Model debugging
• Problem formulation (??)
• Develop novel algorithms on top of our SDK
• Backed by scalable, flexible typed data structures
• Automatic Python wrappers
• Make them available to many other peple
• We’re hiring! jobs@graphlab.com

More Related Content

What's hot

What's hot (20)

Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Servicenow overview
Servicenow overviewServicenow overview
Servicenow overview
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Deep Learning: Towards General Artificial Intelligence
Deep Learning: Towards General Artificial IntelligenceDeep Learning: Towards General Artificial Intelligence
Deep Learning: Towards General Artificial Intelligence
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Application Management Services
Application Management ServicesApplication Management Services
Application Management Services
 
Continual Learning: why, how, and when
Continual Learning: why, how, and whenContinual Learning: why, how, and when
Continual Learning: why, how, and when
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Meta-Learning Presentation
Meta-Learning PresentationMeta-Learning Presentation
Meta-Learning Presentation
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Advancing your data science career
Advancing your data science careerAdvancing your data science career
Advancing your data science career
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
 
A Reference Process Model for Master Data Management
A Reference Process Model for Master Data ManagementA Reference Process Model for Master Data Management
A Reference Process Model for Master Data Management
 
Modern Data Stack in Motion
Modern Data Stack in MotionModern Data Stack in Motion
Modern Data Stack in Motion
 
Real-World Data Governance: Build Your Own Data Governance Tools
Real-World Data Governance: Build Your Own Data Governance ToolsReal-World Data Governance: Build Your Own Data Governance Tools
Real-World Data Governance: Build Your Own Data Governance Tools
 
Data Marketplace - Rethink the Data
Data Marketplace - Rethink the DataData Marketplace - Rethink the Data
Data Marketplace - Rethink the Data
 
META-LEARNING.pptx
META-LEARNING.pptxMETA-LEARNING.pptx
META-LEARNING.pptx
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 

Viewers also liked

Introduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits RealizationIntroduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits Realization
Dave Shiple
 
Enterprise mHealth Strategy
Enterprise mHealth StrategyEnterprise mHealth Strategy
Enterprise mHealth Strategy
Dave Shiple
 
IT Strategic Planning - Methodology and Approach
IT Strategic Planning - Methodology and ApproachIT Strategic Planning - Methodology and Approach
IT Strategic Planning - Methodology and Approach
Dave Shiple
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 

Viewers also liked (16)

The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine Learning
 
What the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and AlgorithmsWhat the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and Algorithms
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
Introduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits RealizationIntroduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits Realization
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Enterprise mHealth Strategy
Enterprise mHealth StrategyEnterprise mHealth Strategy
Enterprise mHealth Strategy
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
IT Strategic Planning - Methodology and Approach
IT Strategic Planning - Methodology and ApproachIT Strategic Planning - Methodology and Approach
IT Strategic Planning - Methodology and Approach
 
Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 

Similar to The Challenges of Bringing Machine Learning to the Masses

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 

Similar to The Challenges of Bringing Machine Learning to the Masses (20)

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Proud to be polyglot
Proud to be polyglotProud to be polyglot
Proud to be polyglot
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 

Recently uploaded

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 

Recently uploaded (20)

Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 

The Challenges of Bringing Machine Learning to the Masses

  • 1. The Challenges of Bringing Machine Learning to the Masses Alice Zheng and Sethu Raman GraphLab Inc. NIPS workshop on Software Engineering for Machine Learning December 13, 2014
  • 3. The need for accessible ML • So much potential in ML • Everyone trying to make sense of their data • ML is transforming lives and industries: personalized medicine, internet search, social networks, advertising, etc. • But success is unattainable to most
  • 4. Building a predictive app Was using 217 business rules hoping world doesn’t change Have an inspiring idea to reinvent their business Key pains: Hiring Talent Shortfall in data-savvy workers needed to make sense out of big data by 2018 [McKinsey 2011] 35% Noisy Space of Tools Data scientists use a variety of tools, across different programming languages… require a lot of context-switching… affects productivity and impedes reproducibility. Ben Lorica, Data Analysis: Just one component of the Data Science workflow
  • 5. Building a predictive app Feature engineering Model definition Training evaluation Data DeploymentMonitoring
  • 6. Pure ML is not enough • Building a predictive application involves much more than just building ML models • System engineering: data storage, computation infrastructure, networking… • Data Science: problem definition, data cleaning, feature engineering • Software development: turn prototype model into bullet-proof production code • Operations engineering: deploy and monitor app • …
  • 7. Pain points • What are the right features? • What model should I use? • How do I train it? • How do I set the tuning parameters? • Do I even have the right data? • Ok, I have a working prototype, now what?
  • 8. Pain points • Increase in data size or decrease in latency requires complete rewrite of code and new toolset • GB – R/scikit-learn/Matlab • TB-PB—Hadoop/Mahout/Spark • Many forms of data and data structures • Images, text, speech, logs • Dense lists, sparse dictionaries, time series • Tables, graphs, matrices, tensors
  • 9. The need for an ML platform • Minimize tool/code switching, maximize performance (speed/accuracy/scale) • Graceful transition from small to large dataset sizes • Flexible, interoperable data types • Minimize complexity • System-agnostic • Simple API • Auto-tune parameters
  • 10. The parallel to databases • What’s an example of a mega-successful platform for data operations? • Databases! • SQL, Oracle, NoSQL, … • What lessons can we bring in from the database world?
  • 12. Database engine components Storage engine Query execution Query optimizer Storage Complex but self-contained, has clean API, only changes when there’s new hardware.
  • 13. Database engine components Storage engine Query execution Query optimizer Storage Complex bag of tricks, no formalism, constantly changing to adapt to data, query, disk characteristics.
  • 14. ML engine components Feature engineering Model definition Training evaluation Data Bags of tricks, expert knowledge, experience, lots of trial and error
  • 15. Advances in databases • Reasonable abstraction—relational DB • Hardware speedups • Pragmatic software implementation Successful platform • Take-away lesson: fast computation engine + “good enough” execution plan
  • 16. To advance ML platforms • ML will be end-user friendly when the platform is clever enough to handle less- than-optimal directions from the user • What needs to happen? • The complexity needs to be automated and wrapped away with neat interfaces between components • Fast components, “good enough” directions
  • 17. GraphLab • Started as a research project at CMU in 2009 • Now a Seattle-based startup
  • 18. The GraphLab CreateTM Solution • Flexible, interoperable data types • SArray+SFrame+SGraph inter-translatable • dense list, sparse array, image, text, tables, graphs • Graceful transition between data sizes • SFrame: memory to disk to distributed • One environment, many substrates • Python front-end • Localhost, cluster, Hadoop, EC2 • End-to-end • Data ingestion+feature engineering+model building+ deployment in a single environment
  • 19. GraphLab Create ML Toolkits Machine Learning Task Business Task Algorithms & SDK Recommender, Target, Social Match, … Regression, Classification, Data Matching,… SVM, Matrix Factorization, LDA, … Developers Savvy Dev & Data Sci. ML experts
  • 20. Demos
  • 21. GLC SDK example • Task: fill in missing value in an array using previous value • Existing solution: • E.g., use Pandas—Python library providing in- memory dataframes • Problem: • Given, say, 25M rows and 50 cols, takes forever to even load the data
  • 22. GLC SDK solution > cat fill.cpp #include <flexible_type/flexible_type.hpp> #include <unity/lib/toolkit_function_macros.hpp> #include <unity/lib/gl_sarray.hpp> using namespace graphlab; gl_sarray fill(gl_sarray sa) { gl_sarray_writer writer(sa.dtype(), 1); flexible_type last_value = sa[0]; for (const auto &elem: sa.range_iterator()) { if (elem != FLEX_UNDEFINED) last_value = elem; writer.write(last_value, 0); } return writer.close(); } BEGIN_FUNCTION_REGISTRATION REGISTER_FUNCTION(fill, "sa"); END_FUNCTION_REGISTRATION
  • 23. GLC SDK solution > cat Makefile all: fill.so fill.so: fill.cpp g++ -std=c++11 $^ -l graphlab –l ~/graphlab-dev/deps/shared-fPIC –o $@ -O3 > python >>> import graphlab as gl >>> gl.ext_import(‘fill.so’, ‘example’) >>> sa = gl.Sarray([1, 2, 3, None, 6]) >>> print gl.extensions.example.fill.fill(sa) [1, 2, 3, 3, 6]
  • 24. Join the revolution! • Research methods to make the following efficient and automatic: • Feature engineering • Model selection • Model debugging • Problem formulation (??) • Develop novel algorithms on top of our SDK • Backed by scalable, flexible typed data structures • Automatic Python wrappers • Make them available to many other peple • We’re hiring! jobs@graphlab.com