SlideShare a Scribd company logo
1 of 50
Download to read offline
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies1
Real-World Machine Learning - Leverage the
Features of MapR Converged Data Platform
Mathieu Dumoulin (mdumoulin@mapr.com)
Mateusz Dymczyk (mateusz@h2o.ai)
Hadoop Summit Tokyo 2016
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 2
Today’s goals
• Machine Learning projects in the Enterprise
have a LOT of requirements beyond training a
good ML model
• Current options are too complex
• Need a Converged Data Platform
• Introduce specific features useful for ML:
– MapR-FS, Volumes, Mirrors and Topologies
– MapR-DB and MapR Streams
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 3
Mathieu Dumoulin, Data Engineer
• Master’s degree in text classification
on Hadoop at Fujitsu Canada’s
Innovation Lab
• In Tokyo, I’ve worked as a Data
Scientist, Search Engineer and Data
Engineer
• I like Scikit-Learn and H2O
• 日本料理が大好き。とくに鍋としゃ
ぶしゃぶです。

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 4
Mateusz Dymczyk, Software Engineer
• M.Sc. in CS (Software and
System Engineering) @ AGH
UST, Poland
• Ph.D. (Machine Learning) dropout
• Software Engineer @ H2O.ai
• Previously ML/NLP @ Fujitsu
Laboratories and en-japan inc
• I’m taking Sommelier classes
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 5
A common machine learning pipeline
*Image from scikit-learn.org
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 6
… meets the real world (Enterprise IT)
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 7
… meets the real world
Data comes from
many sources
maybe very large
Data isn’t
always labeled!
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 8
… meets the real world
Data comes from
many sources,
maybe very large
Needs ETL
and cleaning
Finding the best
algorithm and
parameters can use a
lot of CPU
Data isn’t
always labeled!
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 9
… Meets the real world
Data comes from
many sources,
maybe very large
Needs ETL
and cleaning
Finding the best
algorithm and
parameters can use a
lot of CPU
Data isn’t
always labeled!
From production
systems?
Is it real time?
What server will
serve predictions?
The predictions are
used by another
system...
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 10
Machine learning here...
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 11
Is not the same when you do it here
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 12
Enterprise machine learning matters
Growing number of ML use cases at successful companies
Anomaly
Detection
異常検出
Customer 360
Fraud
Detection
不正検出
Log Security
Analysis
ログ分析
Recommender
Engines
レコメンデーション
Sensor Data
Analysis (IoT)
Personalized
Offers
個人化
Ad Tech
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 13
…but it’s HARD
Ref: http://advancedspark.com/ , https://github.com/fluxcapacitor/pipeline
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 14
There must be a better way...
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 15
Big data Enterprise IT infrastructure for ML
• You can start simple and show value quickly
• It just works. Easy configuration and administration.
• Works with existing systems, and tools
• Includes common basics (File storage, DB, Streams)
• Strong ecosystem support (Apache projects)
• Enterprise class (multi-tenancy, security, HA, support)
An ideal platform for ML:
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies 16
MapR Converged Data Platform
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 17
MapR Converged Data Platform
Open Source Engines & Tools Commercial Engines & Applications
Utility-Grade Platform Services
DataProcessing
Enterprise Storage
MapR-FS MapR-DB MapR Streams
Database Event Streaming
Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy
Search &
Others
Cloud &
Managed
Services
Custom Apps
UnifiedManagementandMonitoring
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 18
MapR is great for Enterprise ML projects
●MapR-FS and NFS mount
●Volumes and Topologies
●Mirrors and Snapshots
●
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 19
MapR Filesystem
•Native implementation in C/C++, it’s fast
•Use it like your own local filesystem
•Everything that can use files works as usual
•Unique MapR technology
•For more info watch on Youtube:
•What is MapR-FS
•MapR-FS vs. HDFS
Working, battle-tested distributed read-write filesystem
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 20
NFS Mount
Mount the cluster as a regular folder
$> sudo mount -o hard,nolock ip-10-0-0-110:/mapr /mapr
$> ll /mapr/hadoopsummit/

total 3

drwxr-xr-x. 3 mapr mapr 1 Oct 13 11:21 apps

drwxr-xr-x. 2 mapr mapr 0 Oct 13 11:12 hbase

drwxr-xr-x. 3 root root 1 Oct 13 11:21 installer

drwxr-xr-x. 2 mapr mapr 0 Oct 13 11:14 opt

drwxrwxrwx. 2 mapr mapr 1 Oct 14 10:41 tmp

drwxr-xr-x. 6 mapr mapr 4 Oct 14 10:52 user

drwxr-xr-x. 3 mapr mapr 1 Oct 13 11:13 var
© 2014 MapR Technologies 21
MapR NFS and Volumes
[mapr@ip-10-0-0-110 mapr]$ pwd
/mapr/hadoopsummit/user/mapr
© 2014 MapR Technologies 22
MapR NFS and Volumes
[mapr@ip-10-0-0-110 mapr]$ pwd
/mapr/hadoopsummit/user/mapr
© 2014 MapR Technologies 23
MapR NFS and Volumes
[mapr@ip-10-0-0-110 mapr]$ pwd
/mapr/hadoopsummit/user/mapr
© 2014 MapR Technologies 24
MapR-FS and NFS mount for ML
• Get started quickly and simply
• Use your favorite tool like...
– Custom code (Scikit-learn, R)
– SPSS, SAS, RapidMiner
– Apache Spark, Drill, Flink
• Super easy data import
– Just save to file on MapR
– Integrate with legacy servers
and code
– Use any ecosystem (Sqoop) it
all works
• Quick and scalable roundtrip
during development
– ETL/cleaning -> train/test ->
predict
– Don’t copy data (cluster to
cluster, local to cluster)
• Run in production direct from
the cluster
– no copying around
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 25
Volumes and Topologies - Managed in MCS
© 2014 MapR Technologies 26
Volumes and Topologies
Volumes are just “regular” volumes
© 2014 MapR Technologies 27
Volumes and Topologies
Volumes are just “regular” volumes
Select what nodes for
volume data = Topology
© 2014 MapR Technologies 28
Volumes and Topologies for ML
• With YARN’s Node Labels, run tasks on nodes with
guaranteed data locality
– Special nodes with GPU, high memory or big CPU
• Multi-Tenancy
– Share cluster with business use cases in production
– Data isolation guaranteed
– Easy unified admin (Data scientists != Hadoop
admin)
– Bigger cluster, more reliable and faster
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 29
Snapshots and Mirrors
© 2014 MapR Technologies 30
Snapshots and Mirrors
© 2014 MapR Technologies 31
Snapshots and Mirrors
© 2014 MapR Technologies 32
Snapshots - Instant point in time save
© 2014 MapR Technologies 33
Mirrors - Physical copy
© 2014 MapR Technologies 34
Snapshots
[... mateusz]$ cd .snapshot
[... .snapshot]$ ll

total 1

drwxr-xr-x. 2 mapr mapr 1 Oct 14 10:56
mateusz.snap1
© 2014 MapR Technologies 35
Snapshots and Mirrors for ML
• Versioned data and models = Repeatable results
– same model, same data guaranteed
– Go back in time for free
• Keep intermediate transformations
– Quickly change your mind, don’t redo work
• A/B Testing easy-mode
© 2014 MapR Technologies 36
Real-time events and DB for ML
• Built-in, no config, it just works
• Support next-gen use cases
– hyper-personalization of web/store content
– IoT Sensor data
• easy to start small but grows with your data/use case
© 2014 MapR Technologies 37
MapR Converged Application Blueprint
• Microservices connected by real-time streams
– Ideal to serve predictions from ML models
• Next-Generation large-scale architecture
• Working example: https://www.mapr.com/appblueprint/
overview
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 38
Converged Data Platform 💖
Machine Learning
• Features that work together to support all phases of ML
• Supports your existing tools/code and the state of the art
large scale frameworks
• Easier to manage, more robust and secure.
• MapR is made for the enterprise and great for ML!
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 39
Demo of H2O on MapR: Features in Action
Agenda
• Why tooling matters in Machine Learning
• What is H2O and Sparkling Water
• Why MapR
• Demo
ML project problems
• Multiple data sources
• Different formats
• Large volumes of data to be read
• System bootstrap time
• Collaboration between data scientists
• Comparing models
• Deployment of the model
• Versioning
• Too many moving parts!
• etc.etc.
Successful ML platform
• Fast ingestion and manipulation of versatile data
• Intuitive modeling UI/API
• Easy model validation, visualisation and comparison
• Easy model deployment w/ versioning for fast predictions
• Written in high performance Java - native Java API
• Supports multiple file formats and data sources
• ETL capabilities
• Highly paralleled and distributed implementation
• Fast in-memory computation on highly compressed data
• Allows you to use all your data without sampling
• Runs on top of most major Hadoop distributions
ML
platform
Ingestions
platform
Big data
platform
What is H2O?
• Open source platform
• Exposes math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.
FlowUI
• Notebook style open
source interface for H2O
• Code execution,
mathematics, plots, and
rich media
Why H2O?
• Fast ingestion and manipulation of versatile data
• Blazing fast data parsing, supports multiple formats and
data sources
• Intuitive modeling UI/API
• FlowUI, R/Python/REST APIs
• Easy model validation, visualisation and comparison
• Cross-validation, FlowUI graphs, comparison via Steam
• Easy model deployment /w versioning for fast predictions
• Model export as POJO, deploy as service via Steam
What is Sparkling Water?
• Framework integrating Spark and H2O
• H2O instances on Spark executors
• Allows to call Spark and H2O methods together
Why MapR?
• H2O + MapR-FS = fast data ingestion made even faster
• Data resilience
• MapR snapshots + H2O modelling from checkpoints =
continuous and versioned modelling
Demo
Airline delay classification
Model predicting
flight delays
ETL Modelling Predictions
Load data from CSVs
Model using
H2O’s GLM
* https://github.com/h2oai/sparkling-water/tree/master/examples/scripts
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 50
Q & A
@mapr
mdumoulin@mapr.com
Engage with us!
mapr-technologies

More Related Content

What's hot

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Ted Dunning
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 

What's hot (20)

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Openstack
OpenstackOpenstack
Openstack
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 

Viewers also liked

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?Attunity
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadeaviadea
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
Map r hadoop-security-mar2014 (2)
Map r hadoop-security-mar2014 (2)Map r hadoop-security-mar2014 (2)
Map r hadoop-security-mar2014 (2)MapR Technologies
 
Design Patterns for working with Fast Data in Kafka
Design Patterns for working with Fast Data in KafkaDesign Patterns for working with Fast Data in Kafka
Design Patterns for working with Fast Data in KafkaIan Downard
 
Why Elastic? @ 50th Vinitaly 2016
Why Elastic? @ 50th Vinitaly 2016Why Elastic? @ 50th Vinitaly 2016
Why Elastic? @ 50th Vinitaly 2016Christoph Wurm
 
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민NAVER D2
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksMapR Technologies
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceMapR Technologies
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014John Berns
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Philly DB MapR Overview
Philly DB MapR OverviewPhilly DB MapR Overview
Philly DB MapR Overview
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Map r hadoop-security-mar2014 (2)
Map r hadoop-security-mar2014 (2)Map r hadoop-security-mar2014 (2)
Map r hadoop-security-mar2014 (2)
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Design Patterns for working with Fast Data in Kafka
Design Patterns for working with Fast Data in KafkaDesign Patterns for working with Fast Data in Kafka
Design Patterns for working with Fast Data in Kafka
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Why Elastic? @ 50th Vinitaly 2016
Why Elastic? @ 50th Vinitaly 2016Why Elastic? @ 50th Vinitaly 2016
Why Elastic? @ 50th Vinitaly 2016
 
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in Finance
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 

Similar to Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...DataWorks Summit/Hadoop Summit
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
MapR and Machine Learning Primer
MapR and Machine Learning PrimerMapR and Machine Learning Primer
MapR and Machine Learning PrimerMathieu Dumoulin
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Adam Doyle
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on SparkMathieu Dumoulin
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)BigDataEverywhere
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Mathieu Dumoulin
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR Technologies
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications MapR Technologies
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...MapR Technologies
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Mathieu Dumoulin
 

Similar to Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform (20)

Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
MapR and Machine Learning Primer
MapR and Machine Learning PrimerMapR and Machine Learning Primer
MapR and Machine Learning Primer
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Streaming in the Extreme
Streaming in the ExtremeStreaming in the Extreme
Streaming in the Extreme
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications
 
MapR Unique features
MapR Unique featuresMapR Unique features
MapR Unique features
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
 

More from Mathieu Dumoulin

State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
 
Real world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comReal world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comMathieu Dumoulin
 
Introduction aux algorithmes map reduce
Introduction aux algorithmes map reduceIntroduction aux algorithmes map reduce
Introduction aux algorithmes map reduceMathieu Dumoulin
 
MapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifiéMapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifiéMathieu Dumoulin
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop QuébecMathieu Dumoulin
 

More from Mathieu Dumoulin (6)

State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
Real world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comReal world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.com
 
Introduction aux algorithmes map reduce
Introduction aux algorithmes map reduceIntroduction aux algorithmes map reduce
Introduction aux algorithmes map reduce
 
MapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifiéMapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifié
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop Québec
 
Introduction à Hadoop
Introduction à HadoopIntroduction à Hadoop
Introduction à Hadoop
 

Recently uploaded

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 

Recently uploaded (20)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 

Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

  • 1. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies1 Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform Mathieu Dumoulin (mdumoulin@mapr.com) Mateusz Dymczyk (mateusz@h2o.ai) Hadoop Summit Tokyo 2016
  • 2. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 2 Today’s goals • Machine Learning projects in the Enterprise have a LOT of requirements beyond training a good ML model • Current options are too complex • Need a Converged Data Platform • Introduce specific features useful for ML: – MapR-FS, Volumes, Mirrors and Topologies – MapR-DB and MapR Streams
  • 3. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 3 Mathieu Dumoulin, Data Engineer • Master’s degree in text classification on Hadoop at Fujitsu Canada’s Innovation Lab • In Tokyo, I’ve worked as a Data Scientist, Search Engineer and Data Engineer • I like Scikit-Learn and H2O • 日本料理が大好き。とくに鍋としゃ ぶしゃぶです。

  • 4. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 4 Mateusz Dymczyk, Software Engineer • M.Sc. in CS (Software and System Engineering) @ AGH UST, Poland • Ph.D. (Machine Learning) dropout • Software Engineer @ H2O.ai • Previously ML/NLP @ Fujitsu Laboratories and en-japan inc • I’m taking Sommelier classes
  • 5. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 5 A common machine learning pipeline *Image from scikit-learn.org
  • 6. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 6 … meets the real world (Enterprise IT)
  • 7. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 7 … meets the real world Data comes from many sources maybe very large Data isn’t always labeled!
  • 8. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 8 … meets the real world Data comes from many sources, maybe very large Needs ETL and cleaning Finding the best algorithm and parameters can use a lot of CPU Data isn’t always labeled!
  • 9. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 9 … Meets the real world Data comes from many sources, maybe very large Needs ETL and cleaning Finding the best algorithm and parameters can use a lot of CPU Data isn’t always labeled! From production systems? Is it real time? What server will serve predictions? The predictions are used by another system...
  • 10. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 10 Machine learning here...
  • 11. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 11 Is not the same when you do it here
  • 12. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 12 Enterprise machine learning matters Growing number of ML use cases at successful companies Anomaly Detection 異常検出 Customer 360 Fraud Detection 不正検出 Log Security Analysis ログ分析 Recommender Engines レコメンデーション Sensor Data Analysis (IoT) Personalized Offers 個人化 Ad Tech
  • 13. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 13 …but it’s HARD Ref: http://advancedspark.com/ , https://github.com/fluxcapacitor/pipeline
  • 14. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 14 There must be a better way...
  • 15. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 15 Big data Enterprise IT infrastructure for ML • You can start simple and show value quickly • It just works. Easy configuration and administration. • Works with existing systems, and tools • Includes common basics (File storage, DB, Streams) • Strong ecosystem support (Apache projects) • Enterprise class (multi-tenancy, security, HA, support) An ideal platform for ML:
  • 16. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies 16 MapR Converged Data Platform
  • 17. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 17 MapR Converged Data Platform Open Source Engines & Tools Commercial Engines & Applications Utility-Grade Platform Services DataProcessing Enterprise Storage MapR-FS MapR-DB MapR Streams Database Event Streaming Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy Search & Others Cloud & Managed Services Custom Apps UnifiedManagementandMonitoring
  • 18. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 18 MapR is great for Enterprise ML projects ●MapR-FS and NFS mount ●Volumes and Topologies ●Mirrors and Snapshots ●
  • 19. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 19 MapR Filesystem •Native implementation in C/C++, it’s fast •Use it like your own local filesystem •Everything that can use files works as usual •Unique MapR technology •For more info watch on Youtube: •What is MapR-FS •MapR-FS vs. HDFS Working, battle-tested distributed read-write filesystem
  • 20. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 20 NFS Mount Mount the cluster as a regular folder $> sudo mount -o hard,nolock ip-10-0-0-110:/mapr /mapr $> ll /mapr/hadoopsummit/
 total 3
 drwxr-xr-x. 3 mapr mapr 1 Oct 13 11:21 apps
 drwxr-xr-x. 2 mapr mapr 0 Oct 13 11:12 hbase
 drwxr-xr-x. 3 root root 1 Oct 13 11:21 installer
 drwxr-xr-x. 2 mapr mapr 0 Oct 13 11:14 opt
 drwxrwxrwx. 2 mapr mapr 1 Oct 14 10:41 tmp
 drwxr-xr-x. 6 mapr mapr 4 Oct 14 10:52 user
 drwxr-xr-x. 3 mapr mapr 1 Oct 13 11:13 var
  • 21. © 2014 MapR Technologies 21 MapR NFS and Volumes [mapr@ip-10-0-0-110 mapr]$ pwd /mapr/hadoopsummit/user/mapr
  • 22. © 2014 MapR Technologies 22 MapR NFS and Volumes [mapr@ip-10-0-0-110 mapr]$ pwd /mapr/hadoopsummit/user/mapr
  • 23. © 2014 MapR Technologies 23 MapR NFS and Volumes [mapr@ip-10-0-0-110 mapr]$ pwd /mapr/hadoopsummit/user/mapr
  • 24. © 2014 MapR Technologies 24 MapR-FS and NFS mount for ML • Get started quickly and simply • Use your favorite tool like... – Custom code (Scikit-learn, R) – SPSS, SAS, RapidMiner – Apache Spark, Drill, Flink • Super easy data import – Just save to file on MapR – Integrate with legacy servers and code – Use any ecosystem (Sqoop) it all works • Quick and scalable roundtrip during development – ETL/cleaning -> train/test -> predict – Don’t copy data (cluster to cluster, local to cluster) • Run in production direct from the cluster – no copying around
  • 25. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 25 Volumes and Topologies - Managed in MCS
  • 26. © 2014 MapR Technologies 26 Volumes and Topologies Volumes are just “regular” volumes
  • 27. © 2014 MapR Technologies 27 Volumes and Topologies Volumes are just “regular” volumes Select what nodes for volume data = Topology
  • 28. © 2014 MapR Technologies 28 Volumes and Topologies for ML • With YARN’s Node Labels, run tasks on nodes with guaranteed data locality – Special nodes with GPU, high memory or big CPU • Multi-Tenancy – Share cluster with business use cases in production – Data isolation guaranteed – Easy unified admin (Data scientists != Hadoop admin) – Bigger cluster, more reliable and faster
  • 29. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 29 Snapshots and Mirrors
  • 30. © 2014 MapR Technologies 30 Snapshots and Mirrors
  • 31. © 2014 MapR Technologies 31 Snapshots and Mirrors
  • 32. © 2014 MapR Technologies 32 Snapshots - Instant point in time save
  • 33. © 2014 MapR Technologies 33 Mirrors - Physical copy
  • 34. © 2014 MapR Technologies 34 Snapshots [... mateusz]$ cd .snapshot [... .snapshot]$ ll
 total 1
 drwxr-xr-x. 2 mapr mapr 1 Oct 14 10:56 mateusz.snap1
  • 35. © 2014 MapR Technologies 35 Snapshots and Mirrors for ML • Versioned data and models = Repeatable results – same model, same data guaranteed – Go back in time for free • Keep intermediate transformations – Quickly change your mind, don’t redo work • A/B Testing easy-mode
  • 36. © 2014 MapR Technologies 36 Real-time events and DB for ML • Built-in, no config, it just works • Support next-gen use cases – hyper-personalization of web/store content – IoT Sensor data • easy to start small but grows with your data/use case
  • 37. © 2014 MapR Technologies 37 MapR Converged Application Blueprint • Microservices connected by real-time streams – Ideal to serve predictions from ML models • Next-Generation large-scale architecture • Working example: https://www.mapr.com/appblueprint/ overview
  • 38. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 38 Converged Data Platform 💖 Machine Learning • Features that work together to support all phases of ML • Supports your existing tools/code and the state of the art large scale frameworks • Easier to manage, more robust and secure. • MapR is made for the enterprise and great for ML!
  • 39. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 39 Demo of H2O on MapR: Features in Action
  • 40. Agenda • Why tooling matters in Machine Learning • What is H2O and Sparkling Water • Why MapR • Demo
  • 41. ML project problems • Multiple data sources • Different formats • Large volumes of data to be read • System bootstrap time • Collaboration between data scientists • Comparing models • Deployment of the model • Versioning • Too many moving parts! • etc.etc.
  • 42. Successful ML platform • Fast ingestion and manipulation of versatile data • Intuitive modeling UI/API • Easy model validation, visualisation and comparison • Easy model deployment w/ versioning for fast predictions
  • 43. • Written in high performance Java - native Java API • Supports multiple file formats and data sources • ETL capabilities • Highly paralleled and distributed implementation • Fast in-memory computation on highly compressed data • Allows you to use all your data without sampling • Runs on top of most major Hadoop distributions ML platform Ingestions platform Big data platform What is H2O? • Open source platform • Exposes math and predictive algorithms • GLM, Random Forest, GBM, Deep Learning etc.
  • 44. FlowUI • Notebook style open source interface for H2O • Code execution, mathematics, plots, and rich media
  • 45. Why H2O? • Fast ingestion and manipulation of versatile data • Blazing fast data parsing, supports multiple formats and data sources • Intuitive modeling UI/API • FlowUI, R/Python/REST APIs • Easy model validation, visualisation and comparison • Cross-validation, FlowUI graphs, comparison via Steam • Easy model deployment /w versioning for fast predictions • Model export as POJO, deploy as service via Steam
  • 46. What is Sparkling Water? • Framework integrating Spark and H2O • H2O instances on Spark executors • Allows to call Spark and H2O methods together
  • 47. Why MapR? • H2O + MapR-FS = fast data ingestion made even faster • Data resilience • MapR snapshots + H2O modelling from checkpoints = continuous and versioned modelling
  • 48. Demo
  • 49. Airline delay classification Model predicting flight delays ETL Modelling Predictions Load data from CSVs Model using H2O’s GLM * https://github.com/h2oai/sparkling-water/tree/master/examples/scripts
  • 50. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 50 Q & A @mapr mdumoulin@mapr.com Engage with us! mapr-technologies