SlideShare a Scribd company logo
1 of 41
Building your in-memory data lake,
Applying Data Quality, and
Getting your data in shape for Machine Learning
Jean Georges Perrin / @jgperrin / Zaloni
The Key to
Machine Learning Is
Prepping the Right Data
JGP
Software, Services, Solutions
USA (Raleigh, NC), India, Dubai…
http://zaloni.com
@zaloni
@jgperrin
Jean Georges Perrin
Chapel Hill, NC
I🏗SW
Big Data, Analytics, Web, Software,
Architecture, Small Data, DevTools,
Java, NLP, Knowledge Sharer,
Informix, Spark, CxO…
And What About You?
• Who is familiar with Data Quality?
• Who has already implemented Data Quality in
Spark?
• Who is expecting to be a ML guru after this
session?
• Who is expecting to provide better insights faster
after this session?
All Over Again
As goes the old adage:
Garbage In,
Garbage Out
xkcd
If Everything Was As Simple…
Dinner
revenue per
number of
guests
…as a Visual Representation
Anomaly #1
Anomaly #2
I Love It When a Plan Comes Together
Data is like a box of
chocolates, you never
know what you're gonna
get
Jean ”Gump” Perrin
June 2017
Data From Everywhere
Databases
RDBMS, NoSQL
Files
CSV, XML, Json, Excel, Photos, Video
Machines
Services, REST, Streaming, IoT…
What Is Data Quality?
Skills (CACTAR):
• Consistency,
• Accuracy,
• Completeness,
• Timeliness,
• Accessibility,
• Reliability.
To allow:
• Operations,
• Decision making,
• Planning,
• Machine Learning.
Ouch, Bad Data Story• Hubble was blind
because of a
2.2nm error
• Legend says it’s a
metric to
imperial/US
conversion
Now it Hurts
• Challenger exploded in
1986 because of a defective
O-ring
• Root causes were:
– Invalid data for
the O-ring parts
– Lack of data-lineage &
reporting
#1 Scripts and the Likes
• Source data are
cleaned by scripts
(shell, Java app…)
• I/O intensive
• Storage space
intensive
• No parallelization
#2 Use Spark SQL
• All in memory!
• But limited to SQL and
built-in function
#3 Use UDFs
• User Defined Function
• SQL can be extended with
UDFs
• UDFs benefit from the
cluster architecture and
distributed processing
• A tool like Zaloni’s Mica
(shameless plug) helps
evaluate the result
SPRU Your UDF
1. Service: build your code, it might be already
existing!
2. Plumbing: connect your business logic to
Spark via an UDF
3. Register: the UDF in Spark
4. Use: the UDF is available in Spark SQL and via
callUDF()
Code Sample #1.2 - Plumbing
package net.jgp.labs.sparkdq4ml.dq.udf;
import org.apache.spark.sql.api.java.UDF1;
import net.jgp.labs.sparkdq4ml.dq.service.*;
public class MinimumPriceDataQualityUdf
implements UDF1< Double, Double > {
public Double call(Double price) throws Exception {
return MinimumPriceDataQualityService.checkMinimumPrice(price);
}
}
/jgperrin/net.jgp.labs.sparkdq4ml
Code Sample #1.3 - Register
SparkSession spark = SparkSession
.builder().appName("DQ4ML").master("local").getOrCreate();
spark.udf().register(
"minimumPriceRule",
new MinimumPriceDataQualityUdf(),
DataTypes.DoubleType);
/jgperrin/net.jgp.labs.sparkdq4ml
Code Sample #1.4 - Use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0");
Using CSV,
but could be
Hive, JDBC,
name it…
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+
|guest|price|
+-----+-----+
| 1|23.24|
| 2|30.89|
| 2|33.74|
| 3|34.89|
| 3|29.91|
| 3| 38.0|
| 4| 40.0|
| 5|120.0|
| 6| 50.0|
| 6|112.0|
| 8| 60.0|
| 8|127.0|
| 8|120.0|
| 9|130.0|
+-----+-----+
Code Sample #1.4 - Use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
| 1| 23.1| 23.1|
| 2| 30.0| 30.0|
| 2| 33.0| 33.0|
| 3| 34.0| 34.0|
| 24|142.0| 142.0|
| 24|138.0| 138.0|
| 25| 3.0| -1.0|
| 26| 10.0| -1.0|
| 25| 15.0| -1.0|
| 26| 4.0| -1.0|
| 28| 10.0| -1.0|
| 28|158.0| 158.0|
| 30|170.0| 170.0|
| 31|180.0| 180.0|
+-----+-----+------------+
Code Sample #1.4 - Use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+
|guest|price|
+-----+-----+
| 1| 23.1|
| 2| 30.0|
| 2| 33.0|
| 3| 34.0|
| 3| 30.0|
| 4| 40.0|
| 19|110.0|
| 20|120.0|
| 22|131.0|
| 24|142.0|
| 24|138.0|
| 28|158.0|
| 30|170.0|
| 31|180.0|
+-----+-----+
Store First, Ask Questions Later
• Build an in-memory data lake 🤔
• Store the data in Spark, trust its data type
detection capabilities
• Build Business Rules with the Business People
• Your data is now trusted and can be used for
Machine Learning (and other activities)
Data Can Now Be Used for ML
• Convert/Adapt dataset to Features and Label
• Required for Linear Regression in MLlib
– Needs a column called label of type double
– Needs a column called features of type VectorUDT
Code Sample #2 - Register & Use
spark.udf().register(
"vectorBuilder",
new VectorBuilder(),
new VectorUDT());
df = df.withColumn("label", df.col("price"));
df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest")));
// ... Lots of complex ML code goes here ...
double p = model.predict(features);
System.out.println("Prediction for " + feature + " guests is " + p);
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+-----+--------+------------------+
|guest|price|label|features| prediction|
+-----+-----+-----+--------+------------------+
| 1| 23.1| 23.1| [1.0]|24.563807596513133|
| 2| 30.0| 30.0| [2.0]|29.595283312577884|
| 2| 33.0| 33.0| [2.0]|29.595283312577884|
| 3| 34.0| 34.0| [3.0]| 34.62675902864264|
| 3| 30.0| 30.0| [3.0]| 34.62675902864264|
| 3| 38.0| 38.0| [3.0]| 34.62675902864264|
| 4| 40.0| 40.0| [4.0]| 39.65823474470739|
| 14| 89.0| 89.0| [14.0]| 89.97299190535493|
| 16|102.0|102.0| [16.0]|100.03594333748444|
| 20|120.0|120.0| [20.0]|120.16184620174346|
| 22|131.0|131.0| [22.0]|130.22479763387295|
| 24|142.0|142.0| [24.0]|140.28774906600245|
+-----+-----+-----+--------+------------------+
Prediction for 40.0 guests is 220.79136052303852
Key Takeaways
• Build your data lake in memory.
• Store first, ask questions later.
• Sample your data (to spot anomalies and more).
• Rely on Spark SQL and UDFs to build a consistent
& coherent dataset.
• Rely on UDFs for prepping ML formats.
• Use Java.
Thank You.
Jean Georges “JGP” Perrin
jgperrin@zaloni.com
@jgperrin
Don’t forget
to rate
Backup Slides & Other Resources
More Reading & Resources
• My blog: http://jgp.net
• Zaloni’s blog (BigData, Hadoop, Spark):
http://blog.zaloni.com
• This code on GitHub:
https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
• Java Code on GitHub:
https://github.com/jgperrin/net.jgp.labs.spark
/jgperrin/net.jgp.labs.sparkdq4ml
Data Quality
• Consistency: across multiple data sources, servers,
platform…
• Accuracy: the value must mean something, I was born,
in France, on 05/10/1971 but I am a Libra (October).
• Completeness: missing fields, values, rows…
• Timeliness: how recent and relevant is the data, e.g.
45m Americans change address every year.
• Accessibility: how easily can data be accessed an
manipulated (Mica…).
• Reliability: reliability of sources, of input, production…
34
Industry-leading enterprise
data lake management,
governance and
self-service platform
Expert data lake
professional services
(Design, Implementation,
Workshops, Training)
Solutions to simplify
implementation
and reduce business risk
Simplifying big data to enable the data-powered enterprise
Zaloni Confidential and Proprietary
35
• Bedrock Data Lake
Management and
Governance Platform
• Design & Implementation
• Discovery Workshops
• Training
• Data Lake in a Box
• Ingestion Factory
• Custom Solutions
Simplifying big data to enable the data-powered enterprise
Zaloni Confidential and Proprietary
• Mica Self-service Data
and Catalog
Software Services Solutions
New Buyer’s Guide on Data Lake Management and
Governance
• Zaloni and Industry analyst
firm, Enterprise Strategy
Group, collaborated on a
guide to help you:
1. Define evaluation criteria and compare
common options
2. Set up a successful proof of concept (PoC)
3. Develop an implementation that is future-
proofed
Free eBooks to help you future-proof your data lake
initiative
Download now at: resources.zaloni.com
DATA LAKE MANAGEMENT
AND GOVERNANCE PLATFORM
SELF-SERVICE DATA PLATFORM
Photo & Other Credits
• JGP @ Zaloni: Pexels
• Machine Learning Comic, xkcd, Attribution-NonCommercial 2.5 License
• Box of Chocolates: Pexels
• Data From Everywhere: multiple authors, Pexels
• The Hubble Space Telescope in orbit, Public Domain, Nasa
• Challender Explosion, Public Domain, Nasa
• Criticality of data quality as exemplied in two disasters, Craig W. Fishera, Bruce R.
Kingmab,
https://pdfs.semanticscholar.org/b9b7/447e0e2d0f255a5a9910fc26f2663dc738f4.pdf.
• Store First, Ask Questions Later: Jean Georges Perrin, Attribution-ShareAlike 4.0
International
• Data Quality for Dummies, https://www.linkedin.com/pulse/data-quality-dummies-lee-cullom
Code on GitHub
Fork it and like it @
https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
Abstract
The key to machine learning is prepping the right data
Everyone wants to apply Machine Learning and its data. Sure. It sounds easy. However, when
it comes to implementation, you need to make sure that the data matches the format expecting
by the libraries.
Machine learning has its challenges and understanding the algorithms is not always easy, but
in this session, we’ll discuss methods to make these challenges less daunting. After a quick
introduction to data quality and a level-set on common vocabulary, we will study the formats
that are required by Spark ML to run its algorithms. We will then see how we can automate the
build through UDF (User Defined Functions) and other techniques. Automation will bring easy
reproducibility, minimize errors, and increase the efficiency of data scientists.
Software engineers who need to understand the requirements and constraints of data
scientists.
Data scientists who need to implement/help implement production systems.
Fundamentals about data quality and how to make sure the data is usable – 10%.
Formats required by Spark ML, new vocabulary – 25%.
Build the required tool set in Java – 65%.

More Related Content

More from Jean-Georges Perrin

Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMJean-Georges Perrin
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)Jean-Georges Perrin
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...Jean-Georges Perrin
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseJean-Georges Perrin
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applicationsJean-Georges Perrin
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and servicesJean-Georges Perrin
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & servicesJean-Georges Perrin
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)Jean-Georges Perrin
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryJean-Georges Perrin
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryJean-Georges Perrin
 
Présentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysPrésentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysJean-Georges Perrin
 
Tendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoTendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoJean-Georges Perrin
 
Le contenu est un réel levier de croissance. On en a la preuve !
Le contenu est un réel levier de croissance. On en a la preuve !Le contenu est un réel levier de croissance. On en a la preuve !
Le contenu est un réel levier de croissance. On en a la preuve !Jean-Georges Perrin
 
Retour de la conférence O'Reilly Web 2.0 2009
Retour de la conférence O'Reilly Web 2.0 2009Retour de la conférence O'Reilly Web 2.0 2009
Retour de la conférence O'Reilly Web 2.0 2009Jean-Georges Perrin
 
IT en Alsace : un secteur économique insoupçonné
IT en Alsace : un secteur économique insoupçonnéIT en Alsace : un secteur économique insoupçonné
IT en Alsace : un secteur économique insoupçonnéJean-Georges Perrin
 

More from Jean-Georges Perrin (20)

Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
 
Présentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysPrésentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT Days
 
Tendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoTendances Web 2011 San Francicsco
Tendances Web 2011 San Francicsco
 
Le contenu est un réel levier de croissance. On en a la preuve !
Le contenu est un réel levier de croissance. On en a la preuve !Le contenu est un réel levier de croissance. On en a la preuve !
Le contenu est un réel levier de croissance. On en a la preuve !
 
Retour de la conférence O'Reilly Web 2.0 2009
Retour de la conférence O'Reilly Web 2.0 2009Retour de la conférence O'Reilly Web 2.0 2009
Retour de la conférence O'Reilly Web 2.0 2009
 
Comprendre Web 2.0
Comprendre Web 2.0Comprendre Web 2.0
Comprendre Web 2.0
 
Web Intelligent
Web IntelligentWeb Intelligent
Web Intelligent
 
IT en Alsace : un secteur économique insoupçonné
IT en Alsace : un secteur économique insoupçonnéIT en Alsace : un secteur économique insoupçonné
IT en Alsace : un secteur économique insoupçonné
 

Recently uploaded

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Recently uploaded (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

The Key to Machine Learning Is Prepping the Right Data

  • 1. Building your in-memory data lake, Applying Data Quality, and Getting your data in shape for Machine Learning Jean Georges Perrin / @jgperrin / Zaloni The Key to Machine Learning Is Prepping the Right Data
  • 2. JGP Software, Services, Solutions USA (Raleigh, NC), India, Dubai… http://zaloni.com @zaloni @jgperrin Jean Georges Perrin Chapel Hill, NC I🏗SW Big Data, Analytics, Web, Software, Architecture, Small Data, DevTools, Java, NLP, Knowledge Sharer, Informix, Spark, CxO…
  • 3. And What About You? • Who is familiar with Data Quality? • Who has already implemented Data Quality in Spark? • Who is expecting to be a ML guru after this session? • Who is expecting to provide better insights faster after this session?
  • 4. All Over Again As goes the old adage: Garbage In, Garbage Out xkcd
  • 5. If Everything Was As Simple… Dinner revenue per number of guests
  • 6. …as a Visual Representation Anomaly #1 Anomaly #2
  • 7. I Love It When a Plan Comes Together
  • 8. Data is like a box of chocolates, you never know what you're gonna get Jean ”Gump” Perrin June 2017
  • 9. Data From Everywhere Databases RDBMS, NoSQL Files CSV, XML, Json, Excel, Photos, Video Machines Services, REST, Streaming, IoT…
  • 10. What Is Data Quality? Skills (CACTAR): • Consistency, • Accuracy, • Completeness, • Timeliness, • Accessibility, • Reliability. To allow: • Operations, • Decision making, • Planning, • Machine Learning.
  • 11. Ouch, Bad Data Story• Hubble was blind because of a 2.2nm error • Legend says it’s a metric to imperial/US conversion
  • 12. Now it Hurts • Challenger exploded in 1986 because of a defective O-ring • Root causes were: – Invalid data for the O-ring parts – Lack of data-lineage & reporting
  • 13. #1 Scripts and the Likes • Source data are cleaned by scripts (shell, Java app…) • I/O intensive • Storage space intensive • No parallelization
  • 14. #2 Use Spark SQL • All in memory! • But limited to SQL and built-in function
  • 15. #3 Use UDFs • User Defined Function • SQL can be extended with UDFs • UDFs benefit from the cluster architecture and distributed processing • A tool like Zaloni’s Mica (shameless plug) helps evaluate the result
  • 16. SPRU Your UDF 1. Service: build your code, it might be already existing! 2. Plumbing: connect your business logic to Spark via an UDF 3. Register: the UDF in Spark 4. Use: the UDF is available in Spark SQL and via callUDF()
  • 17. Code Sample #1.2 - Plumbing package net.jgp.labs.sparkdq4ml.dq.udf; import org.apache.spark.sql.api.java.UDF1; import net.jgp.labs.sparkdq4ml.dq.service.*; public class MinimumPriceDataQualityUdf implements UDF1< Double, Double > { public Double call(Double price) throws Exception { return MinimumPriceDataQualityService.checkMinimumPrice(price); } } /jgperrin/net.jgp.labs.sparkdq4ml
  • 18. Code Sample #1.3 - Register SparkSession spark = SparkSession .builder().appName("DQ4ML").master("local").getOrCreate(); spark.udf().register( "minimumPriceRule", new MinimumPriceDataQualityUdf(), DataTypes.DoubleType); /jgperrin/net.jgp.labs.sparkdq4ml
  • 19. Code Sample #1.4 - Use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); Using CSV, but could be Hive, JDBC, name it… /jgperrin/net.jgp.labs.sparkdq4ml
  • 20. +-----+-----+ |guest|price| +-----+-----+ | 1|23.24| | 2|30.89| | 2|33.74| | 3|34.89| | 3|29.91| | 3| 38.0| | 4| 40.0| | 5|120.0| | 6| 50.0| | 6|112.0| | 8| 60.0| | 8|127.0| | 8|120.0| | 9|130.0| +-----+-----+
  • 21. Code Sample #1.4 - Use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 22. +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ | 1| 23.1| 23.1| | 2| 30.0| 30.0| | 2| 33.0| 33.0| | 3| 34.0| 34.0| | 24|142.0| 142.0| | 24|138.0| 138.0| | 25| 3.0| -1.0| | 26| 10.0| -1.0| | 25| 15.0| -1.0| | 26| 4.0| -1.0| | 28| 10.0| -1.0| | 28|158.0| 158.0| | 30|170.0| 170.0| | 31|180.0| 180.0| +-----+-----+------------+
  • 23. Code Sample #1.4 - Use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 24. +-----+-----+ |guest|price| +-----+-----+ | 1| 23.1| | 2| 30.0| | 2| 33.0| | 3| 34.0| | 3| 30.0| | 4| 40.0| | 19|110.0| | 20|120.0| | 22|131.0| | 24|142.0| | 24|138.0| | 28|158.0| | 30|170.0| | 31|180.0| +-----+-----+
  • 25. Store First, Ask Questions Later • Build an in-memory data lake 🤔 • Store the data in Spark, trust its data type detection capabilities • Build Business Rules with the Business People • Your data is now trusted and can be used for Machine Learning (and other activities)
  • 26. Data Can Now Be Used for ML • Convert/Adapt dataset to Features and Label • Required for Linear Regression in MLlib – Needs a column called label of type double – Needs a column called features of type VectorUDT
  • 27. Code Sample #2 - Register & Use spark.udf().register( "vectorBuilder", new VectorBuilder(), new VectorUDT()); df = df.withColumn("label", df.col("price")); df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest"))); // ... Lots of complex ML code goes here ... double p = model.predict(features); System.out.println("Prediction for " + feature + " guests is " + p); /jgperrin/net.jgp.labs.sparkdq4ml
  • 28. +-----+-----+-----+--------+------------------+ |guest|price|label|features| prediction| +-----+-----+-----+--------+------------------+ | 1| 23.1| 23.1| [1.0]|24.563807596513133| | 2| 30.0| 30.0| [2.0]|29.595283312577884| | 2| 33.0| 33.0| [2.0]|29.595283312577884| | 3| 34.0| 34.0| [3.0]| 34.62675902864264| | 3| 30.0| 30.0| [3.0]| 34.62675902864264| | 3| 38.0| 38.0| [3.0]| 34.62675902864264| | 4| 40.0| 40.0| [4.0]| 39.65823474470739| | 14| 89.0| 89.0| [14.0]| 89.97299190535493| | 16|102.0|102.0| [16.0]|100.03594333748444| | 20|120.0|120.0| [20.0]|120.16184620174346| | 22|131.0|131.0| [22.0]|130.22479763387295| | 24|142.0|142.0| [24.0]|140.28774906600245| +-----+-----+-----+--------+------------------+ Prediction for 40.0 guests is 220.79136052303852
  • 29. Key Takeaways • Build your data lake in memory. • Store first, ask questions later. • Sample your data (to spot anomalies and more). • Rely on Spark SQL and UDFs to build a consistent & coherent dataset. • Rely on UDFs for prepping ML formats. • Use Java.
  • 30. Thank You. Jean Georges “JGP” Perrin jgperrin@zaloni.com @jgperrin Don’t forget to rate
  • 31. Backup Slides & Other Resources
  • 32. More Reading & Resources • My blog: http://jgp.net • Zaloni’s blog (BigData, Hadoop, Spark): http://blog.zaloni.com • This code on GitHub: https://github.com/jgperrin/net.jgp.labs.sparkdq4ml • Java Code on GitHub: https://github.com/jgperrin/net.jgp.labs.spark /jgperrin/net.jgp.labs.sparkdq4ml
  • 33. Data Quality • Consistency: across multiple data sources, servers, platform… • Accuracy: the value must mean something, I was born, in France, on 05/10/1971 but I am a Libra (October). • Completeness: missing fields, values, rows… • Timeliness: how recent and relevant is the data, e.g. 45m Americans change address every year. • Accessibility: how easily can data be accessed an manipulated (Mica…). • Reliability: reliability of sources, of input, production…
  • 34. 34 Industry-leading enterprise data lake management, governance and self-service platform Expert data lake professional services (Design, Implementation, Workshops, Training) Solutions to simplify implementation and reduce business risk Simplifying big data to enable the data-powered enterprise Zaloni Confidential and Proprietary
  • 35. 35 • Bedrock Data Lake Management and Governance Platform • Design & Implementation • Discovery Workshops • Training • Data Lake in a Box • Ingestion Factory • Custom Solutions Simplifying big data to enable the data-powered enterprise Zaloni Confidential and Proprietary • Mica Self-service Data and Catalog Software Services Solutions
  • 36. New Buyer’s Guide on Data Lake Management and Governance • Zaloni and Industry analyst firm, Enterprise Strategy Group, collaborated on a guide to help you: 1. Define evaluation criteria and compare common options 2. Set up a successful proof of concept (PoC) 3. Develop an implementation that is future- proofed
  • 37. Free eBooks to help you future-proof your data lake initiative Download now at: resources.zaloni.com
  • 38. DATA LAKE MANAGEMENT AND GOVERNANCE PLATFORM SELF-SERVICE DATA PLATFORM
  • 39. Photo & Other Credits • JGP @ Zaloni: Pexels • Machine Learning Comic, xkcd, Attribution-NonCommercial 2.5 License • Box of Chocolates: Pexels • Data From Everywhere: multiple authors, Pexels • The Hubble Space Telescope in orbit, Public Domain, Nasa • Challender Explosion, Public Domain, Nasa • Criticality of data quality as exemplied in two disasters, Craig W. Fishera, Bruce R. Kingmab, https://pdfs.semanticscholar.org/b9b7/447e0e2d0f255a5a9910fc26f2663dc738f4.pdf. • Store First, Ask Questions Later: Jean Georges Perrin, Attribution-ShareAlike 4.0 International • Data Quality for Dummies, https://www.linkedin.com/pulse/data-quality-dummies-lee-cullom
  • 40. Code on GitHub Fork it and like it @ https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
  • 41. Abstract The key to machine learning is prepping the right data Everyone wants to apply Machine Learning and its data. Sure. It sounds easy. However, when it comes to implementation, you need to make sure that the data matches the format expecting by the libraries. Machine learning has its challenges and understanding the algorithms is not always easy, but in this session, we’ll discuss methods to make these challenges less daunting. After a quick introduction to data quality and a level-set on common vocabulary, we will study the formats that are required by Spark ML to run its algorithms. We will then see how we can automate the build through UDF (User Defined Functions) and other techniques. Automation will bring easy reproducibility, minimize errors, and increase the efficiency of data scientists. Software engineers who need to understand the requirements and constraints of data scientists. Data scientists who need to implement/help implement production systems. Fundamentals about data quality and how to make sure the data is usable – 10%. Formats required by Spark ML, new vocabulary – 25%. Build the required tool set in Java – 65%.

Editor's Notes

  1. Highly iterative algorithms.
  2. Basic linear regression, easiest model
  3. Not a Mongolian Warlord. Consistency: across multiple data sources, servers, platform… You do not want one sensor to report a different level of humidity if they are placed next to one another. Accuracy: the value must mean something, I was born, in France, on 05/10/1971 but I am a Libra (October). Completeness: missing fields, values, rows… Timeliness: how recent and relevant is the data, e.g. 45m Americans change address every year. Accessibility: how easily can data be accessed an manipulated (Mica…). Reliability: reliability of sources, of input, production…
  4. Repeat the question