SlideShare a Scribd company logo
1 of 41
Download to read offline
Building your in-memory data lake,
Applying Data Quality, and
Getting your data in shape for Machine Learning
Jean Georges Perrin / @jgperrin / Zaloni
The Key to
Machine Learning Is
Prepping the Right Data
JGP
Software, Services, Solutions
USA (Raleigh, NC), India, Dubai…
http://zaloni.com
@zaloni
@jgperrin
Jean Georges Perrin
Chapel Hill, NC
I 🏗SW
Big Data, Analytics, Web, Software,
Architecture, Small Data, DevTools,
Java, NLP, Knowledge Sharer,
Informix, Spark, CxO…
And What About You?
• Who is familiar with Data Quality?
• Who has already implemented Data Quality in
Spark?
• Who is expecting to be a ML guru after this
session?
• Who is expecting to provide better insights faster
after this session?
All Over Again
As goes the old adage:
Garbage In,
Garbage Out
xkcd
If Everything Was As Simple…
0
20
40
60
80
100
120
140
160
180
200
-1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Dinner
revenue per
number of
guests
…as a Visual Representation
0
20
40
60
80
100
120
140
160
180
200
-1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Anomaly #1
Anomaly #2
0
20
40
60
80
100
120
140
160
180
200
-1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
I Love It When a Plan Comes Together
Data is like a box of
chocolates, you
never know what
you're gonna get
Jean ”Gump” Perrin
June 2017
Data From Everywhere
Databases
RDBMS, NoSQL
Files
CSV, XML, Json, Excel, Photos, Video
Machines
Services, REST, Streaming, IoT…
What Is Data Quality?
Skills (CACTAR):
• Consistency,
• Accuracy,
• Completeness,
• Timeliness,
• Accessibility,
• Reliability.
To allow:
• Operations,
• Decision making,
• Planning,
• Machine Learning.
Ouch, Bad Data Story• Hubble was blind
because of a
2.2nm error
• Legend says it’s a
metric to
imperial/US
conversion
Now it Hurts
• Challenger exploded in
1986 because of a defective
O-ring
• Root causes were:
– Invalid data for
the O-ring parts
– Lack of data-lineage &
reporting
#1 Scripts and the Likes
• Source data are
cleaned by scripts
(shell, Java app…)
• I/O intensive
• Storage space
intensive
• No parallelization
#2 Use Spark SQL
• All in memory!
• But limited to SQL and
built-in function
#3 Use UDFs
• User Defined Function
• SQL can be extended with
UDFs
• UDFs benefit from the
cluster architecture and
distributed processing
• A tool like Zaloni’s Mica
(shameless plug) helps
evaluate the result
SPRU Your UDF
1. Service: build your code, it might be already
existing!
2. Plumbing: connect your business logic to
Spark via an UDF
3. Register: the UDF in Spark
4. Use: the UDF is available in Spark SQL and via
callUDF()
Code Sample #1.2 - Plumbing
package net.jgp.labs.sparkdq4ml.dq.udf;
import org.apache.spark.sql.api.java.UDF1;
import net.jgp.labs.sparkdq4ml.dq.service.*;
public class MinimumPriceDataQualityUdf
implements UDF1< Double, Double > {
public Double call(Double price) throws Exception {
return MinimumPriceDataQualityService.checkMinimumPrice(price);
}
}
/jgperrin/net.jgp.labs.sparkdq4ml
Code Sample #1.3 - Register
SparkSession spark = SparkSession
.builder().appName("DQ4ML").master("local").getOrCreate();
spark.udf().register(
"minimumPriceRule",
new MinimumPriceDataQualityUdf(),
DataTypes.DoubleType);
/jgperrin/net.jgp.labs.sparkdq4ml
Code Sample #1.4 - Use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
Using CSV,
but could be
Hive, JDBC,
name it…
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+
|guest|price|
+-----+-----+
| 1|23.24|
| 2|30.89|
| 2|33.74|
| 3|34.89|
| 3|29.91|
| 3| 38.0|
| 4| 40.0|
| 5|120.0|
| 6| 50.0|
| 6|112.0|
| 8| 60.0|
| 8|127.0|
| 8|120.0|
| 9|130.0|
+-----+-----+
Code Sample #1.4 - Use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
| 1| 23.1| 23.1|
| 2| 30.0| 30.0|
| 2| 33.0| 33.0|
| 3| 34.0| 34.0|
| 24|142.0| 142.0|
| 24|138.0| 138.0|
| 25| 3.0| -1.0|
| 26| 10.0| -1.0|
| 25| 15.0| -1.0|
| 26| 4.0| -1.0|
| 28| 10.0| -1.0|
| 28|158.0| 158.0|
| 30|170.0| 170.0|
| 31|180.0| 180.0|
+-----+-----+------------+
Code Sample #1.4 - Use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+
|guest|price|
+-----+-----+
| 1| 23.1|
| 2| 30.0|
| 2| 33.0|
| 3| 34.0|
| 3| 30.0|
| 4| 40.0|
| 19|110.0|
| 20|120.0|
| 22|131.0|
| 24|142.0|
| 24|138.0|
| 28|158.0|
| 30|170.0|
| 31|180.0|
+-----+-----+
Store First, Ask Questions Later
• Build an in-memory data lake 🤔
• Store the data in Spark, trust its data type
detection capabilities
• Build Business Rules with the Business People
• Your data is now trusted and can be used for
Machine Learning (and other activities)
Data Can Now Be Used for ML
• Convert/Adapt dataset to Features and Label
• Required for Linear Regression in MLlib
– Needs a column called label of type double
– Needs a column called features of type VectorUDT
Code Sample #2 - Register & Use
spark.udf().register(
"vectorBuilder",
new VectorBuilder(),
new VectorUDT());
df = df.withColumn("label", df.col("price"));
df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest")));
// ... Lots of complex ML code goes here ...
double p = model.predict(features);
System.out.println("Prediction for " + feature + " guests is " + p);
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+-----+--------+------------------+
|guest|price|label|features| prediction|
+-----+-----+-----+--------+------------------+
| 1| 23.1| 23.1| [1.0]|24.563807596513133|
| 2| 30.0| 30.0| [2.0]|29.595283312577884|
| 2| 33.0| 33.0| [2.0]|29.595283312577884|
| 3| 34.0| 34.0| [3.0]| 34.62675902864264|
| 3| 30.0| 30.0| [3.0]| 34.62675902864264|
| 3| 38.0| 38.0| [3.0]| 34.62675902864264|
| 4| 40.0| 40.0| [4.0]| 39.65823474470739|
| 14| 89.0| 89.0| [14.0]| 89.97299190535493|
| 16|102.0|102.0| [16.0]|100.03594333748444|
| 20|120.0|120.0| [20.0]|120.16184620174346|
| 22|131.0|131.0| [22.0]|130.22479763387295|
| 24|142.0|142.0| [24.0]|140.28774906600245|
+-----+-----+-----+--------+------------------+
Prediction for 40.0 guests is 220.79136052303852
Key Takeaways
• Build your data lake in memory.
• Store first, ask questions later.
• Sample your data (to spot anomalies and more).
• Rely on Spark SQL and UDFs to build a consistent
& coherent dataset.
• Rely on UDFs for prepping ML formats.
• Use Java.
Thank You.
Jean Georges “JGP” Perrin
jgperrin@zaloni.com
@jgperrin
Don’t forget
to rate
Backup Slides & Other Resources
More Reading & Resources
• My blog: http://jgp.net
• Zaloni’s blog (BigData, Hadoop, Spark):
http://blog.zaloni.com
• This code on GitHub:
https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
• Java Code on GitHub:
https://github.com/jgperrin/net.jgp.labs.spark
/jgperrin/net.jgp.labs.sparkdq4ml
Data Quality
• Consistency: across multiple data sources, servers,
platform…
• Accuracy: the value must mean something, I was born,
in France, on 05/10/1971 but I am a Libra (October).
• Completeness: missing fields, values, rows…
• Timeliness: how recent and relevant is the data, e.g.
45m Americans change address every year.
• Accessibility: how easily can data be accessed an
manipulated (Mica…).
• Reliability: reliability of sources, of input, production…
34
Industry-leading enterprise
data lake management,
governance and
self-service platform
Expert data lake
professional services
(Design, Implementation,
Workshops, Training)
Solutions to simplify
implementation
and reduce business risk
Simplifying big data to enable the data-powered enterprise
Zaloni Confidential and Proprietary
35
• Bedrock Data Lake
Management and
Governance Platform
• Design & Implementation
• Discovery Workshops
• Training
• Data Lake in a Box
• Ingestion Factory
• Custom Solutions
Simplifying big data to enable the data-powered enterprise
Zaloni Confidential and Proprietary
• Mica Self-service Data
and Catalog
Software Services Solutions
New Buyer’s Guide on Data Lake Management and
Governance
• Zaloni and Industry analyst
firm, Enterprise Strategy
Group, collaborated on a
guide to help you:
1. Define evaluation criteria and compare
common options
2. Set up a successful proof of concept (PoC)
3. Develop an implementation that is future-
proofed
Free eBooks to help you future-proof your data lake
initiative
Download now at: resources.zaloni.com
DATA LAKE MANAGEMENT
AND GOVERNANCE PLATFORM
SELF-SERVICE DATA PLATFORM
Photo & Other Credits
• JGP @ Zaloni: Pexels
• Machine Learning Comic, xkcd, Attribution-NonCommercial 2.5 License
• Box of Chocolates: Pexels
• Data From Everywhere: multiple authors, Pexels
• The Hubble Space Telescope in orbit, Public Domain, Nasa
• Challender Explosion, Public Domain, Nasa
• Criticality of data quality as exemplied in two disasters, Craig W. Fishera, Bruce R.
Kingmab,
https://pdfs.semanticscholar.org/b9b7/447e0e2d0f255a5a9910fc26f2663dc738f4.pdf.
• Store First, Ask Questions Later: Jean Georges Perrin, Attribution-ShareAlike 4.0
International
• Data Quality for Dummies, https://www.linkedin.com/pulse/data-quality-dummies-lee-cullom
Code on GitHub
Fork it and like it @
https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
Abstract
The key to machine learning is prepping the right data
Everyone wants to apply Machine Learning and its data. Sure. It sounds easy. However, when
it comes to implementation, you need to make sure that the data matches the format expecting
by the libraries.
Machine learning has its challenges and understanding the algorithms is not always easy, but
in this session, we’ll discuss methods to make these challenges less daunting. After a quick
introduction to data quality and a level-set on common vocabulary, we will study the formats
that are required by Spark ML to run its algorithms. We will then see how we can automate the
build through UDF (User Defined Functions) and other techniques. Automation will bring easy
reproducibility, minimize errors, and increase the efficiency of data scientists.
Software engineers who need to understand the requirements and constraints of data
scientists.
Data scientists who need to implement/help implement production systems.
Fundamentals about data quality and how to make sure the data is usable – 10%.
Formats required by Spark ML, new vocabulary – 25%.
Build the required tool set in Java – 65%.

More Related Content

What's hot

Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Spark Summit
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedJamie Grier
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Lightbend
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 PivotalOpenSourceHub
 

What's hot (20)

Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory Speed
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
 

Similar to Building Your Data Lake for ML

Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecJosh Patterson
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecJosh Patterson
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰Wayne Chen
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nltieleman
 
Blue Teaming on a Budget of Zero
Blue Teaming on a Budget of ZeroBlue Teaming on a Budget of Zero
Blue Teaming on a Budget of ZeroKyle Bubp
 
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class Chris Gates
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify Dataconomy Media
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18DataconomyGmbH
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark PresentationStephen Borg
 
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018Mike Harris
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agilityelliando dias
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
Growing in the wild. The story by cubrid database developers (Esen Sagynov, E...
Growing in the wild. The story by cubrid database developers (Esen Sagynov, E...Growing in the wild. The story by cubrid database developers (Esen Sagynov, E...
Growing in the wild. The story by cubrid database developers (Esen Sagynov, E...Ontico
 

Similar to Building Your Data Lake for ML (20)

Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Blue Teaming on a Budget of Zero
Blue Teaming on a Budget of ZeroBlue Teaming on a Budget of Zero
Blue Teaming on a Budget of Zero
 
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
How I Learned to Stop Worrying and Love Legacy Code - Ox:Agile 2018
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Growing in the wild. The story by cubrid database developers (Esen Sagynov, E...
Growing in the wild. The story by cubrid database developers (Esen Sagynov, E...Growing in the wild. The story by cubrid database developers (Esen Sagynov, E...
Growing in the wild. The story by cubrid database developers (Esen Sagynov, E...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 

Recently uploaded (20)

SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 

Building Your Data Lake for ML

  • 1. Building your in-memory data lake, Applying Data Quality, and Getting your data in shape for Machine Learning Jean Georges Perrin / @jgperrin / Zaloni The Key to Machine Learning Is Prepping the Right Data
  • 2. JGP Software, Services, Solutions USA (Raleigh, NC), India, Dubai… http://zaloni.com @zaloni @jgperrin Jean Georges Perrin Chapel Hill, NC I 🏗SW Big Data, Analytics, Web, Software, Architecture, Small Data, DevTools, Java, NLP, Knowledge Sharer, Informix, Spark, CxO…
  • 3. And What About You? • Who is familiar with Data Quality? • Who has already implemented Data Quality in Spark? • Who is expecting to be a ML guru after this session? • Who is expecting to provide better insights faster after this session?
  • 4. All Over Again As goes the old adage: Garbage In, Garbage Out xkcd
  • 5. If Everything Was As Simple… 0 20 40 60 80 100 120 140 160 180 200 -1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Dinner revenue per number of guests
  • 6. …as a Visual Representation 0 20 40 60 80 100 120 140 160 180 200 -1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Anomaly #1 Anomaly #2
  • 7. 0 20 40 60 80 100 120 140 160 180 200 -1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 I Love It When a Plan Comes Together
  • 8. Data is like a box of chocolates, you never know what you're gonna get Jean ”Gump” Perrin June 2017
  • 9. Data From Everywhere Databases RDBMS, NoSQL Files CSV, XML, Json, Excel, Photos, Video Machines Services, REST, Streaming, IoT…
  • 10. What Is Data Quality? Skills (CACTAR): • Consistency, • Accuracy, • Completeness, • Timeliness, • Accessibility, • Reliability. To allow: • Operations, • Decision making, • Planning, • Machine Learning.
  • 11. Ouch, Bad Data Story• Hubble was blind because of a 2.2nm error • Legend says it’s a metric to imperial/US conversion
  • 12. Now it Hurts • Challenger exploded in 1986 because of a defective O-ring • Root causes were: – Invalid data for the O-ring parts – Lack of data-lineage & reporting
  • 13. #1 Scripts and the Likes • Source data are cleaned by scripts (shell, Java app…) • I/O intensive • Storage space intensive • No parallelization
  • 14. #2 Use Spark SQL • All in memory! • But limited to SQL and built-in function
  • 15. #3 Use UDFs • User Defined Function • SQL can be extended with UDFs • UDFs benefit from the cluster architecture and distributed processing • A tool like Zaloni’s Mica (shameless plug) helps evaluate the result
  • 16. SPRU Your UDF 1. Service: build your code, it might be already existing! 2. Plumbing: connect your business logic to Spark via an UDF 3. Register: the UDF in Spark 4. Use: the UDF is available in Spark SQL and via callUDF()
  • 17. Code Sample #1.2 - Plumbing package net.jgp.labs.sparkdq4ml.dq.udf; import org.apache.spark.sql.api.java.UDF1; import net.jgp.labs.sparkdq4ml.dq.service.*; public class MinimumPriceDataQualityUdf implements UDF1< Double, Double > { public Double call(Double price) throws Exception { return MinimumPriceDataQualityService.checkMinimumPrice(price); } } /jgperrin/net.jgp.labs.sparkdq4ml
  • 18. Code Sample #1.3 - Register SparkSession spark = SparkSession .builder().appName("DQ4ML").master("local").getOrCreate(); spark.udf().register( "minimumPriceRule", new MinimumPriceDataQualityUdf(), DataTypes.DoubleType); /jgperrin/net.jgp.labs.sparkdq4ml
  • 19. Code Sample #1.4 - Use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); Using CSV, but could be Hive, JDBC, name it… /jgperrin/net.jgp.labs.sparkdq4ml
  • 20. +-----+-----+ |guest|price| +-----+-----+ | 1|23.24| | 2|30.89| | 2|33.74| | 3|34.89| | 3|29.91| | 3| 38.0| | 4| 40.0| | 5|120.0| | 6| 50.0| | 6|112.0| | 8| 60.0| | 8|127.0| | 8|120.0| | 9|130.0| +-----+-----+
  • 21. Code Sample #1.4 - Use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 22. +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ | 1| 23.1| 23.1| | 2| 30.0| 30.0| | 2| 33.0| 33.0| | 3| 34.0| 34.0| | 24|142.0| 142.0| | 24|138.0| 138.0| | 25| 3.0| -1.0| | 26| 10.0| -1.0| | 25| 15.0| -1.0| | 26| 4.0| -1.0| | 28| 10.0| -1.0| | 28|158.0| 158.0| | 30|170.0| 170.0| | 31|180.0| 180.0| +-----+-----+------------+
  • 23. Code Sample #1.4 - Use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 24. +-----+-----+ |guest|price| +-----+-----+ | 1| 23.1| | 2| 30.0| | 2| 33.0| | 3| 34.0| | 3| 30.0| | 4| 40.0| | 19|110.0| | 20|120.0| | 22|131.0| | 24|142.0| | 24|138.0| | 28|158.0| | 30|170.0| | 31|180.0| +-----+-----+
  • 25. Store First, Ask Questions Later • Build an in-memory data lake 🤔 • Store the data in Spark, trust its data type detection capabilities • Build Business Rules with the Business People • Your data is now trusted and can be used for Machine Learning (and other activities)
  • 26. Data Can Now Be Used for ML • Convert/Adapt dataset to Features and Label • Required for Linear Regression in MLlib – Needs a column called label of type double – Needs a column called features of type VectorUDT
  • 27. Code Sample #2 - Register & Use spark.udf().register( "vectorBuilder", new VectorBuilder(), new VectorUDT()); df = df.withColumn("label", df.col("price")); df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest"))); // ... Lots of complex ML code goes here ... double p = model.predict(features); System.out.println("Prediction for " + feature + " guests is " + p); /jgperrin/net.jgp.labs.sparkdq4ml
  • 28. +-----+-----+-----+--------+------------------+ |guest|price|label|features| prediction| +-----+-----+-----+--------+------------------+ | 1| 23.1| 23.1| [1.0]|24.563807596513133| | 2| 30.0| 30.0| [2.0]|29.595283312577884| | 2| 33.0| 33.0| [2.0]|29.595283312577884| | 3| 34.0| 34.0| [3.0]| 34.62675902864264| | 3| 30.0| 30.0| [3.0]| 34.62675902864264| | 3| 38.0| 38.0| [3.0]| 34.62675902864264| | 4| 40.0| 40.0| [4.0]| 39.65823474470739| | 14| 89.0| 89.0| [14.0]| 89.97299190535493| | 16|102.0|102.0| [16.0]|100.03594333748444| | 20|120.0|120.0| [20.0]|120.16184620174346| | 22|131.0|131.0| [22.0]|130.22479763387295| | 24|142.0|142.0| [24.0]|140.28774906600245| +-----+-----+-----+--------+------------------+ Prediction for 40.0 guests is 220.79136052303852
  • 29. Key Takeaways • Build your data lake in memory. • Store first, ask questions later. • Sample your data (to spot anomalies and more). • Rely on Spark SQL and UDFs to build a consistent & coherent dataset. • Rely on UDFs for prepping ML formats. • Use Java.
  • 30. Thank You. Jean Georges “JGP” Perrin jgperrin@zaloni.com @jgperrin Don’t forget to rate
  • 31. Backup Slides & Other Resources
  • 32. More Reading & Resources • My blog: http://jgp.net • Zaloni’s blog (BigData, Hadoop, Spark): http://blog.zaloni.com • This code on GitHub: https://github.com/jgperrin/net.jgp.labs.sparkdq4ml • Java Code on GitHub: https://github.com/jgperrin/net.jgp.labs.spark /jgperrin/net.jgp.labs.sparkdq4ml
  • 33. Data Quality • Consistency: across multiple data sources, servers, platform… • Accuracy: the value must mean something, I was born, in France, on 05/10/1971 but I am a Libra (October). • Completeness: missing fields, values, rows… • Timeliness: how recent and relevant is the data, e.g. 45m Americans change address every year. • Accessibility: how easily can data be accessed an manipulated (Mica…). • Reliability: reliability of sources, of input, production…
  • 34. 34 Industry-leading enterprise data lake management, governance and self-service platform Expert data lake professional services (Design, Implementation, Workshops, Training) Solutions to simplify implementation and reduce business risk Simplifying big data to enable the data-powered enterprise Zaloni Confidential and Proprietary
  • 35. 35 • Bedrock Data Lake Management and Governance Platform • Design & Implementation • Discovery Workshops • Training • Data Lake in a Box • Ingestion Factory • Custom Solutions Simplifying big data to enable the data-powered enterprise Zaloni Confidential and Proprietary • Mica Self-service Data and Catalog Software Services Solutions
  • 36. New Buyer’s Guide on Data Lake Management and Governance • Zaloni and Industry analyst firm, Enterprise Strategy Group, collaborated on a guide to help you: 1. Define evaluation criteria and compare common options 2. Set up a successful proof of concept (PoC) 3. Develop an implementation that is future- proofed
  • 37. Free eBooks to help you future-proof your data lake initiative Download now at: resources.zaloni.com
  • 38. DATA LAKE MANAGEMENT AND GOVERNANCE PLATFORM SELF-SERVICE DATA PLATFORM
  • 39. Photo & Other Credits • JGP @ Zaloni: Pexels • Machine Learning Comic, xkcd, Attribution-NonCommercial 2.5 License • Box of Chocolates: Pexels • Data From Everywhere: multiple authors, Pexels • The Hubble Space Telescope in orbit, Public Domain, Nasa • Challender Explosion, Public Domain, Nasa • Criticality of data quality as exemplied in two disasters, Craig W. Fishera, Bruce R. Kingmab, https://pdfs.semanticscholar.org/b9b7/447e0e2d0f255a5a9910fc26f2663dc738f4.pdf. • Store First, Ask Questions Later: Jean Georges Perrin, Attribution-ShareAlike 4.0 International • Data Quality for Dummies, https://www.linkedin.com/pulse/data-quality-dummies-lee-cullom
  • 40. Code on GitHub Fork it and like it @ https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
  • 41. Abstract The key to machine learning is prepping the right data Everyone wants to apply Machine Learning and its data. Sure. It sounds easy. However, when it comes to implementation, you need to make sure that the data matches the format expecting by the libraries. Machine learning has its challenges and understanding the algorithms is not always easy, but in this session, we’ll discuss methods to make these challenges less daunting. After a quick introduction to data quality and a level-set on common vocabulary, we will study the formats that are required by Spark ML to run its algorithms. We will then see how we can automate the build through UDF (User Defined Functions) and other techniques. Automation will bring easy reproducibility, minimize errors, and increase the efficiency of data scientists. Software engineers who need to understand the requirements and constraints of data scientists. Data scientists who need to implement/help implement production systems. Fundamentals about data quality and how to make sure the data is usable – 10%. Formats required by Spark ML, new vocabulary – 25%. Build the required tool set in Java – 65%.