SlideShare a Scribd company logo
1 of 76
Big Data made
easy with a
All Things Open 2018
Raleigh, NC
October 22nd 2018
Jean Georges Perrin
Software whatever since 1983
x10
@jgperrin
http://jgp.net [blog]
Who are thou?
๏ Experience with Spark?
๏ Experience with Hadoop?
๏ Experience with Scala?
๏ Java?
๏ PHP guru?
๏ Front-end developer?
But most importantly…
๏ … who is not a developer?
๏ What is Big Data?
๏ What is. ?
๏ What can I do with. ?
๏ What is a app, anyway?
๏ Install a bunch of software
๏ A first example
๏ Understand what just happened
๏ Another example, slightly more complex, because you are now ready
๏ But now, sincerely what just happened?
๏ Let’s do AI!
๏ Going further
Agenda
Caution
Hands-on tutorial
Tons of content
Unknown crowd
Unknown setting
3
V4
5
Biiiiiiiig Data
๏ volume
๏ variety
๏ velocity
๏ variability
๏ value
Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
Data is
considered big
when they need
more than one
computer to be
processed
Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
Title TextAnalytics operating system
Apps
Analytics
Distrib.
An analytics operating system?
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
Some use cases
๏ NCEatery.com
๏ Restaurant analytics
๏ 1.57×10^21 datapoints analyzed
๏ (@ Lumeris)
๏ General compute
๏ Distributed data transfer
๏ IBM
๏ DSX (Data Science Experience)
๏ Watson Data Studio
๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/
๏ CERN
๏ Analysis of the science experiments in the LHC - Large Hadron Collider
What a typical app looks like?
Connect to the
cluster
Load Data
Do something
with the data
Share the results
Convinced?
On y va!
http://bit.ly/spark-clego
Get all the S T U F F
๏ Go to http://jgp.net/ato2018
๏ Install the software
๏ Access the source code
Download some tools
๏ Java JDK 1.8
๏ http://bit.ly/javadk8
๏ Eclipse Oxygen or later
๏ http://bit.ly/eclipseo2
๏ Other nice to have
๏ Maven
๏ SourceTree or git (command line)
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
http://www.eclipse.org/downloads/eclipse-packages/
Aren’t you glad we are
using Java?
Lab #1 - ingestion
Lab #1 - ingestion
๏ Goal

In a Big Data project, ingestion is the first operation.
You get the data “in.”
๏ Source code

https://github.com/jgperrin/
net.jgp.books.sparkWithJava.ch01
Getting deeper
๏ Go to net.jgp.books.sparkWithJava.ch01
๏ Open CsvToDataframeApp.java
๏ Right click, Run As, Java Application
+---+--------+--------------------+-----------+--------------------+
| id|authorId| title|releaseDate| link|
+---+--------+--------------------+-----------+--------------------+
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
+---+--------+--------------------+-----------+--------------------+
only showing top 5 rows
package net.jgp.books.sparkWithJava.ch01;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class CsvToDataframeApp {
public static void main(String[] args) {
CsvToDataframeApp app = new CsvToDataframeApp();
app.start();
}
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local")
.getOrCreate();
// Reads a CSV file with header, called books.csv, stores it in a dataframe
Dataset<Row> df = spark.read().format("csv")
.option("header", "true")
.load("data/books.csv");
// Shows at most 5 rows from the dataframe
df.show(5);
}
}
/jgperrin/net.jgp.books.sparkWithJava.ch01
So what happened?
Let’s try to understand a little more
Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
Apache Spark
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 1 -
HW
Node 2 -
HW
Node 3 -
HW
Node 4 -
HW
Spark SQL Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Node 5 -
OS
Node 5 -
HW
Your application
…
…
Unified API
Node 6 -
OS
Node 6 -
HW
Node 7 -
OS
Node 7 -
HW
Node 8 -
OS
Node 8 -
HW
Spark SQL
Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Your application
Dataframe
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 5 -
OS
…
Node 6 -
OS
Node 7 -
OS
Node 8 -
OS
Unified API
Title Text Spark SQL
Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Dataframe
Lab #2 - a bit of analytics
But really just a bit
Lab #2 - a little bit of analytics
๏ Goal

From two datasets, one containing books, the other
authors, list the authors with most books, by
number of books
๏ Source code

https://github.com/jgperrin/net.jgp.labs.spark
If it was in a relational database
books.csv
authors.csv
id: integer
name: string
link: string
wikipedia: string
id: integer
authorId: integer
title: string
releaseDate: string
link: string
Basic analytics
๏ Go to net.jgp.labs.spark.l200_join.l030_count_books
๏ Open AuthorsAndBooksCountBooksApp.java
๏ Right click, Run As, Java Application
+---+-------------------+--------------------+-----+
| id| name| link|count|
+---+-------------------+--------------------+-----+
| 1| J. K. Rowling|http://amzn.to/2l...| 4|
| 12|William Shakespeare|http://amzn.to/2j...| 3|
| 4| Denis Diderot|http://amzn.to/2i...| 2|
| 6| Craig Walls|http://amzn.to/2A...| 2|
| 2|Jean Georges Perrin|http://amzn.to/2w...| 2|
| 3| Mark Twain|http://amzn.to/2v...| 2|
| 11| Alan Mycroft|http://amzn.to/2A...| 1|
| 10| Mario Fusco|http://amzn.to/2A...| 1|
…
+---+-------------------+--------------------+-----+
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- link: string (nullable = true)
|-- count: long (nullable = false)
package net.jgp.labs.spark.l200_join.l030_count_books;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class AuthorsAndBooksCountBooksApp {
public static void main(String[] args) {
AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp();
app.start();
}
private void start() {
SparkSession spark = SparkSession.builder()
.appName("Authors and Books")
.master("local").getOrCreate();
String filename = "data/authors.csv";
Dataset<Row> authorsDf = spark.read()
.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load(filename);
/jgperrin/net.jgp.labs.spark
filename = "data/books.csv";
Dataset<Row> booksDf = spark.read()
.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load(filename);
Dataset<Row> libraryDf = authorsDf
.join(
booksDf,
authorsDf.col("id").equalTo(booksDf.col("authorId")),
"left")
.withColumn("bookId", booksDf.col("id"))
.drop(booksDf.col("id"))
.groupBy(
authorsDf.col("id"),
authorsDf.col("name"),
authorsDf.col("link"))
.count();
libraryDf = libraryDf
.orderBy(libraryDf.col("count").desc());
libraryDf.show();
libraryDf.printSchema();
}
}
/jgperrin/net.jgp.labs.spark
The art of delegating
Slave (Worker)
Driver Master
Cluster Manager
Slave (Worker)
Your app
Executor
Task
Task
Executor
Task
Task
Lab #3 - an even smaller bit of AI
But really just a bit
Title Text
What’s AI

anyway?
Popular beliefs
๏ Robot with human-like behavior
๏ HAL from 2001
๏ Isaac Asimov
๏ Potential ethic problems
General AI Narrow AI
๏ Lots of mathematics
๏ Heavy calculations
๏ Algorithms
๏ Self-driving cars
Current state-of-the-art
Title Text
I am an expert in
general AI
ARTIFICIAL INTELLIGENCE
is Machine Learning
๏ Common algorithms
๏Linear and logistic regressions
๏Classification and regression trees
๏K-nearest neighbors (KNN)
๏Deep learning
๏Subset of ML
๏Artificial neural networks (ANNs)
๏Super CPU intensive, use of GPU
Machine learning
There are two kinds of data scientists:
1) Those who can extrapolate from incomplete data.
Title TextDATA
Engineer
DATA
Scientist
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for
predictive models.
Explore data to find
hidden gems and
patterns.
Tells stories to key
stakeholders.
Title Text
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
All over again
As goes the old adage:
Garbage In,
Garbage Out
xkcd
Lab #3 - correcting and extrapolating data
Lab #3 - projecting data
๏ Goal

As a restaurant manager, I want to predict how
much revenue will bring a party of 40
๏ Source code

https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
If everything was as simple…
Dinner
revenue per
number of
guests
…as a visual representation
Anomaly #1
Anomaly #2
I love it when a plan comes together
Load & Format
+-----+-----+
|guest|price|
+-----+-----+
| 1| 23.1|
| 2| 30.0|
…
+-----+-----+
only showing top 20 rows
----
1st DQ rule
+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
| 1| 23.1| 23.1|
| 2| 30.0| 30.0|
…
| 25| 3.0| -1.0|
| 26| 10.0| -1.0|
…
+-----+-----+------------+
…
+-----+-----+-----+--------+
|guest|price|label|features|
+-----+-----+-----+--------+
| 1| 23.1| 23.1| [1.0]|
| 2| 30.0| 30.0| [2.0]|
…
+-----+-----+-----+--------+
only showing top 20 rows
…
RMSE: 2.802192495300457
r2: 0.9965340953376102
Intersection: 20.979190460591575
Regression parameter: 1.0
Tol: 1.0E-6
Prediction for 40.0 guests is 218.00351106373822
Using existing data quality rules
package net.jgp.labs.sparkdq4ml.dq.udf;


import org.apache.spark.sql.api.java.UDF1;
import net.jgp.labs.sparkdq4ml.dq.service.*;


public class MinimumPriceDataQualityUdf
implements UDF1< Double, Double > {
public Double call(Double price) throws Exception {
return MinimumPriceDataQualityService.checkMinimumPrice(price);
}
}
/jgperrin/net.jgp.labs.sparkdq4ml
If price is ok, returns price,
if price is ko, returns -1
Telling Spark to use my DQ rules
SparkSession spark = SparkSession.builder()
.appName("DQ4ML").master("local").getOrCreate();
spark.udf().register(
"minimumPriceRule",
new MinimumPriceDataQualityUdf(),
DataTypes.DoubleType);
spark.udf().register(
"priceCorrelationRule",
new PriceCorrelationDataQualityUdf(),
DataTypes.DoubleType);
/jgperrin/net.jgp.labs.sparkdq4ml
Loading my dataset
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
Using CSV,
but could be
Hive, JDBC,
name it…
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+
|guest|price|
+-----+-----+
|   1|23.24|
|    2|30.89|
|    2|33.74|
|    3|34.89|
|    3|29.91|
|    3| 38.0|
|    4| 40.0|
|    5|120.0|
|    6| 50.0|
|    6|112.0|
|    8| 60.0|
|    8|127.0|
|    8|120.0|
|    9|130.0|
+-----+-----+
Raw data, contains the anomalies
Apply the rules
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
|    1| 23.1|        23.1|
|    2| 30.0|        30.0|
|    2| 33.0|        33.0|
|    3| 34.0|        34.0|
|   24|142.0|       142.0|
|   24|138.0|       138.0|
|   25|  3.0|        -1.0|
|   26| 10.0|        -1.0|
|   25| 15.0|        -1.0|
|   26|  4.0|        -1.0|
|   28| 10.0|        -1.0|
|   28|158.0|       158.0|
|   30|170.0|       170.0|
|   31|180.0|       180.0|
+-----+-----+------------+
Anomalies are clearly identified by -1, so they
can be easily filtered
Filtering out anomalies
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+
|guest|price|
+-----+-----+
|    1| 23.1|
|    2| 30.0|
|    2| 33.0|
|    3| 34.0|
|    3| 30.0|
|    4| 40.0|
|   19|110.0|
|   20|120.0|
|   22|131.0|
|   24|142.0|
|   24|138.0|
|   28|158.0|
|   30|170.0|
|   31|180.0|
+-----+-----+
Useable data
Format the data for ML
๏ Convert/Adapt dataset to Features and Label
๏ Required for Linear Regression in MLlib
๏Needs a column called label of type double
๏Needs a column called features of type VectorUDT
Format the data for ML
spark.udf().register(
"vectorBuilder",
new VectorBuilder(),
new VectorUDT());
df = df.withColumn("label", df.col("price"));
df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest")));


// ... Lots of complex ML code goes here ...
double p = model.predict(features);
System.out.println("Prediction for " + feature + " guests is " + p);
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+-----+--------+------------------+
|guest|price|label|features|        prediction|
+-----+-----+-----+--------+------------------+
|    1| 23.1| 23.1|   [1.0]|24.563807596513133|
|    2| 30.0| 30.0|   [2.0]|29.595283312577884|
|    2| 33.0| 33.0|   [2.0]|29.595283312577884|
|    3| 34.0| 34.0|   [3.0]| 34.62675902864264|
|    3| 30.0| 30.0|   [3.0]| 34.62675902864264|
|    3| 38.0| 38.0|   [3.0]| 34.62675902864264|
|    4| 40.0| 40.0|   [4.0]| 39.65823474470739|
|   14| 89.0| 89.0|  [14.0]| 89.97299190535493|
|   16|102.0|102.0|  [16.0]|100.03594333748444|
|   20|120.0|120.0|  [20.0]|120.16184620174346|
|   22|131.0|131.0|  [22.0]|130.22479763387295|
|   24|142.0|142.0|  [24.0]|140.28774906600245|
+-----+-----+-----+--------+------------------+
Prediction for 40.0 guests is 220.79136052303852
Prediction for 40 guests
(the complex ML code)
LinearRegression lr = new LinearRegression()
.setMaxIter(40)
.setRegParam(1)
.setElasticNetParam(1);
LinearRegressionModel model = lr.fit(df);
Double feature = 40.0;
Vector features = Vectors.dense(40.0);
double p = model.predict(features);
/jgperrin/net.jgp.labs.sparkdq4ml
Define algorithms and its (hyper)parameters
Created a model from our data
Apply the model to a new dataset: predict
It’s all about the base model
Same model
Trainer ModelDataset #1
ModelDataset #2
Predicted
Data
Step 1:
Learning
phase
Step 2..n:
Predictive
phase
Conclusion
A (Big) Data Scenario
Data
Raw
Data
Ingestion
DataQuality
Pure
Data
Transformation
Rich
Data
Load/Publish
Data
Key takeaways
๏ Big Data is easier than one could think
๏ Java is the way to go (or Python)
๏ New vocabulary for using Spark
๏ You have a friend to help (ok, me)
๏ Spark is fun
๏ Spark is easily extensible
Going further
๏ Contact me @jgperrin
๏ Join the Spark User mailing list
๏ Get help from Stack Overflow
๏ fb.com/TriangleSpark
Going further
Spark in action (Second edition, MEAP)
by Jean Georges Perrin
published by Manning
http://jgp.net/sia
sparkjava-65CE ctwato18
One free book 40% off
Thanks
@jgperrin
Backup
Spark in Action
Second edition, MEAP
by Jean Georges Perrin
published by Manning
http://jgp.net/sia
Credits
Photos by Pexels
IBM PC XT by Ruben de Rijcke - http://dendmedia.com/
vintage/ - Own work, CC BY 3.0, https://
commons.wikimedia.org/w/index.php?curid=3610862
Illustrations © Jean Georges Perrin
No more slides
You’re on your own!

More Related Content

What's hot

How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2Oodsc
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problemsgrepalex
 
Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, ja...
Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, ja...Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, ja...
Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, ja...javier ramirez
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Xapian vs sphinx
Xapian vs sphinxXapian vs sphinx
Xapian vs sphinxpanjunyong
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 
Python網站框架絕技: Django 完全攻略班
Python網站框架絕技: Django 完全攻略班Python網站框架絕技: Django 完全攻略班
Python網站框架絕技: Django 完全攻略班Paul Chao
 
天下武功唯快不破:利用串流資料實做出即時分類器和即時推薦系統
天下武功唯快不破:利用串流資料實做出即時分類器和即時推薦系統天下武功唯快不破:利用串流資料實做出即時分類器和即時推薦系統
天下武功唯快不破:利用串流資料實做出即時分類器和即時推薦系統台灣資料科學年會
 
Graph Databases in Python (PyCon Canada 2012)
Graph Databases in Python (PyCon Canada 2012)Graph Databases in Python (PyCon Canada 2012)
Graph Databases in Python (PyCon Canada 2012)Javier de la Rosa
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataJimmy Angelakos
 
Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ...
 Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ... Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ...
Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ...Databricks
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]David Przybilla
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesItamar
 

What's hot (20)

How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2O
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problems
 
Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, ja...
Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, ja...Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, ja...
Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, ja...
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Xapian vs sphinx
Xapian vs sphinxXapian vs sphinx
Xapian vs sphinx
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Python網站框架絕技: Django 完全攻略班
Python網站框架絕技: Django 完全攻略班Python網站框架絕技: Django 完全攻略班
Python網站框架絕技: Django 完全攻略班
 
天下武功唯快不破:利用串流資料實做出即時分類器和即時推薦系統
天下武功唯快不破:利用串流資料實做出即時分類器和即時推薦系統天下武功唯快不破:利用串流資料實做出即時分類器和即時推薦系統
天下武功唯快不破:利用串流資料實做出即時分類器和即時推薦系統
 
Graph Databases in Python (PyCon Canada 2012)
Graph Databases in Python (PyCon Canada 2012)Graph Databases in Python (PyCon Canada 2012)
Graph Databases in Python (PyCon Canada 2012)
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
 
Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ...
 Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ... Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ...
Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ...
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 

Similar to Big Data made easy with a Spark

Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionAnant Corporation
 
Nancy CLI. Automated Database Experiments
Nancy CLI. Automated Database ExperimentsNancy CLI. Automated Database Experiments
Nancy CLI. Automated Database ExperimentsNikolay Samokhvalov
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossAndrew Flatters
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...it-people
 
Overview of Modern Graph Analysis Tools
Overview of Modern Graph Analysis ToolsOverview of Modern Graph Analysis Tools
Overview of Modern Graph Analysis ToolsKeiichiro Ono
 
Критика "библиотечного" подхода в разработке под Android. UA Mobile 2016.
Критика "библиотечного" подхода в разработке под Android. UA Mobile 2016.Критика "библиотечного" подхода в разработке под Android. UA Mobile 2016.
Критика "библиотечного" подхода в разработке под Android. UA Mobile 2016.UA Mobile
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Jean-Georges Perrin
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Anand Sampat
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
 
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...Keiichiro Ono
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning InfrastructureSigOpt
 

Similar to Big Data made easy with a Spark (20)

Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
 
Nancy CLI. Automated Database Experiments
Nancy CLI. Automated Database ExperimentsNancy CLI. Automated Database Experiments
Nancy CLI. Automated Database Experiments
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
 
Overview of Modern Graph Analysis Tools
Overview of Modern Graph Analysis ToolsOverview of Modern Graph Analysis Tools
Overview of Modern Graph Analysis Tools
 
Критика "библиотечного" подхода в разработке под Android. UA Mobile 2016.
Критика "библиотечного" подхода в разработке под Android. UA Mobile 2016.Критика "библиотечного" подхода в разработке под Android. UA Mobile 2016.
Критика "библиотечного" подхода в разработке под Android. UA Mobile 2016.
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 

More from Jean-Georges Perrin

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the worldJean-Georges Perrin
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsJean-Georges Perrin
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunityJean-Georges Perrin
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMJean-Georges Perrin
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)Jean-Georges Perrin
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...Jean-Georges Perrin
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseJean-Georges Perrin
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applicationsJean-Georges Perrin
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and servicesJean-Georges Perrin
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & servicesJean-Georges Perrin
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)Jean-Georges Perrin
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryJean-Georges Perrin
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryJean-Georges Perrin
 
Présentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysPrésentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysJean-Georges Perrin
 
Tendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoTendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoJean-Georges Perrin
 

More from Jean-Georges Perrin (20)

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the world
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Why i love Apache Spark?
Why i love Apache Spark?Why i love Apache Spark?
Why i love Apache Spark?
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentions
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the Community
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
 
Présentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysPrésentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT Days
 
Tendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoTendances Web 2011 San Francicsco
Tendances Web 2011 San Francicsco
 

Recently uploaded

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Big Data made easy with a Spark

  • 1. Big Data made easy with a All Things Open 2018 Raleigh, NC October 22nd 2018
  • 2. Jean Georges Perrin Software whatever since 1983 x10 @jgperrin http://jgp.net [blog]
  • 3.
  • 4. Who are thou? ๏ Experience with Spark? ๏ Experience with Hadoop? ๏ Experience with Scala? ๏ Java? ๏ PHP guru? ๏ Front-end developer?
  • 5. But most importantly… ๏ … who is not a developer?
  • 6. ๏ What is Big Data? ๏ What is. ? ๏ What can I do with. ? ๏ What is a app, anyway? ๏ Install a bunch of software ๏ A first example ๏ Understand what just happened ๏ Another example, slightly more complex, because you are now ready ๏ But now, sincerely what just happened? ๏ Let’s do AI! ๏ Going further Agenda
  • 7. Caution Hands-on tutorial Tons of content Unknown crowd Unknown setting
  • 8. 3 V4 5 Biiiiiiiig Data ๏ volume ๏ variety ๏ velocity ๏ variability ๏ value Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
  • 9. Data is considered big when they need more than one computer to be processed Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
  • 11. Apps Analytics Distrib. An analytics operating system? Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS
  • 12. An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 13. Some use cases ๏ NCEatery.com ๏ Restaurant analytics ๏ 1.57×10^21 datapoints analyzed ๏ (@ Lumeris) ๏ General compute ๏ Distributed data transfer ๏ IBM ๏ DSX (Data Science Experience) ๏ Watson Data Studio ๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/ ๏ CERN ๏ Analysis of the science experiments in the LHC - Large Hadron Collider
  • 14. What a typical app looks like? Connect to the cluster Load Data Do something with the data Share the results
  • 17. Get all the S T U F F ๏ Go to http://jgp.net/ato2018 ๏ Install the software ๏ Access the source code
  • 18. Download some tools ๏ Java JDK 1.8 ๏ http://bit.ly/javadk8 ๏ Eclipse Oxygen or later ๏ http://bit.ly/eclipseo2 ๏ Other nice to have ๏ Maven ๏ SourceTree or git (command line) http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html http://www.eclipse.org/downloads/eclipse-packages/
  • 19. Aren’t you glad we are using Java?
  • 20. Lab #1 - ingestion
  • 21. Lab #1 - ingestion ๏ Goal
 In a Big Data project, ingestion is the first operation. You get the data “in.” ๏ Source code
 https://github.com/jgperrin/ net.jgp.books.sparkWithJava.ch01
  • 22. Getting deeper ๏ Go to net.jgp.books.sparkWithJava.ch01 ๏ Open CsvToDataframeApp.java ๏ Right click, Run As, Java Application
  • 23. +---+--------+--------------------+-----------+--------------------+ | id|authorId| title|releaseDate| link| +---+--------+--------------------+-----------+--------------------+ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| +---+--------+--------------------+-----------+--------------------+ only showing top 5 rows
  • 24. package net.jgp.books.sparkWithJava.ch01; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class CsvToDataframeApp { public static void main(String[] args) { CsvToDataframeApp app = new CsvToDataframeApp(); app.start(); } private void start() { // Creates a session on a local master SparkSession spark = SparkSession.builder() .appName("CSV to Dataset") .master("local") .getOrCreate(); // Reads a CSV file with header, called books.csv, stores it in a dataframe Dataset<Row> df = spark.read().format("csv") .option("header", "true") .load("data/books.csv"); // Shows at most 5 rows from the dataframe df.show(5); } } /jgperrin/net.jgp.books.sparkWithJava.ch01
  • 25. So what happened? Let’s try to understand a little more
  • 27. Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - HW Node 2 - HW Node 3 - HW Node 4 - HW Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Node 5 - OS Node 5 - HW Your application … … Unified API Node 6 - OS Node 6 - HW Node 7 - OS Node 7 - HW Node 8 - OS Node 8 - HW
  • 28. Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Your application Dataframe Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 5 - OS … Node 6 - OS Node 7 - OS Node 8 - OS Unified API
  • 29. Title Text Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Dataframe
  • 30. Lab #2 - a bit of analytics But really just a bit
  • 31. Lab #2 - a little bit of analytics ๏ Goal
 From two datasets, one containing books, the other authors, list the authors with most books, by number of books ๏ Source code
 https://github.com/jgperrin/net.jgp.labs.spark
  • 32. If it was in a relational database books.csv authors.csv id: integer name: string link: string wikipedia: string id: integer authorId: integer title: string releaseDate: string link: string
  • 33. Basic analytics ๏ Go to net.jgp.labs.spark.l200_join.l030_count_books ๏ Open AuthorsAndBooksCountBooksApp.java ๏ Right click, Run As, Java Application
  • 34. +---+-------------------+--------------------+-----+ | id| name| link|count| +---+-------------------+--------------------+-----+ | 1| J. K. Rowling|http://amzn.to/2l...| 4| | 12|William Shakespeare|http://amzn.to/2j...| 3| | 4| Denis Diderot|http://amzn.to/2i...| 2| | 6| Craig Walls|http://amzn.to/2A...| 2| | 2|Jean Georges Perrin|http://amzn.to/2w...| 2| | 3| Mark Twain|http://amzn.to/2v...| 2| | 11| Alan Mycroft|http://amzn.to/2A...| 1| | 10| Mario Fusco|http://amzn.to/2A...| 1| … +---+-------------------+--------------------+-----+ root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- link: string (nullable = true) |-- count: long (nullable = false)
  • 35. package net.jgp.labs.spark.l200_join.l030_count_books; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class AuthorsAndBooksCountBooksApp { public static void main(String[] args) { AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp(); app.start(); } private void start() { SparkSession spark = SparkSession.builder() .appName("Authors and Books") .master("local").getOrCreate(); String filename = "data/authors.csv"; Dataset<Row> authorsDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); /jgperrin/net.jgp.labs.spark
  • 36. filename = "data/books.csv"; Dataset<Row> booksDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); Dataset<Row> libraryDf = authorsDf .join( booksDf, authorsDf.col("id").equalTo(booksDf.col("authorId")), "left") .withColumn("bookId", booksDf.col("id")) .drop(booksDf.col("id")) .groupBy( authorsDf.col("id"), authorsDf.col("name"), authorsDf.col("link")) .count(); libraryDf = libraryDf .orderBy(libraryDf.col("count").desc()); libraryDf.show(); libraryDf.printSchema(); } } /jgperrin/net.jgp.labs.spark
  • 37. The art of delegating
  • 38. Slave (Worker) Driver Master Cluster Manager Slave (Worker) Your app Executor Task Task Executor Task Task
  • 39. Lab #3 - an even smaller bit of AI But really just a bit
  • 41. Popular beliefs ๏ Robot with human-like behavior ๏ HAL from 2001 ๏ Isaac Asimov ๏ Potential ethic problems General AI Narrow AI ๏ Lots of mathematics ๏ Heavy calculations ๏ Algorithms ๏ Self-driving cars Current state-of-the-art
  • 42. Title Text I am an expert in general AI ARTIFICIAL INTELLIGENCE is Machine Learning
  • 43. ๏ Common algorithms ๏Linear and logistic regressions ๏Classification and regression trees ๏K-nearest neighbors (KNN) ๏Deep learning ๏Subset of ML ๏Artificial neural networks (ANNs) ๏Super CPU intensive, use of GPU Machine learning
  • 44. There are two kinds of data scientists: 1) Those who can extrapolate from incomplete data.
  • 45. Title TextDATA Engineer DATA Scientist Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders.
  • 46. Title Text Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer DATA Engineer DATA Scientist SQL
  • 47. All over again As goes the old adage: Garbage In, Garbage Out xkcd
  • 48. Lab #3 - correcting and extrapolating data
  • 49. Lab #3 - projecting data ๏ Goal
 As a restaurant manager, I want to predict how much revenue will bring a party of 40 ๏ Source code
 https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
  • 50. If everything was as simple… Dinner revenue per number of guests
  • 51. …as a visual representation Anomaly #1 Anomaly #2
  • 52. I love it when a plan comes together
  • 53. Load & Format +-----+-----+ |guest|price| +-----+-----+ | 1| 23.1| | 2| 30.0| … +-----+-----+ only showing top 20 rows ---- 1st DQ rule +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ | 1| 23.1| 23.1| | 2| 30.0| 30.0| … | 25| 3.0| -1.0| | 26| 10.0| -1.0| … +-----+-----+------------+ … +-----+-----+-----+--------+ |guest|price|label|features| +-----+-----+-----+--------+ | 1| 23.1| 23.1| [1.0]| | 2| 30.0| 30.0| [2.0]| … +-----+-----+-----+--------+ only showing top 20 rows … RMSE: 2.802192495300457 r2: 0.9965340953376102 Intersection: 20.979190460591575 Regression parameter: 1.0 Tol: 1.0E-6 Prediction for 40.0 guests is 218.00351106373822
  • 54. Using existing data quality rules package net.jgp.labs.sparkdq4ml.dq.udf; 
 import org.apache.spark.sql.api.java.UDF1; import net.jgp.labs.sparkdq4ml.dq.service.*; 
 public class MinimumPriceDataQualityUdf implements UDF1< Double, Double > { public Double call(Double price) throws Exception { return MinimumPriceDataQualityService.checkMinimumPrice(price); } } /jgperrin/net.jgp.labs.sparkdq4ml If price is ok, returns price, if price is ko, returns -1
  • 55. Telling Spark to use my DQ rules SparkSession spark = SparkSession.builder() .appName("DQ4ML").master("local").getOrCreate(); spark.udf().register( "minimumPriceRule", new MinimumPriceDataQualityUdf(), DataTypes.DoubleType); spark.udf().register( "priceCorrelationRule", new PriceCorrelationDataQualityUdf(), DataTypes.DoubleType); /jgperrin/net.jgp.labs.sparkdq4ml
  • 56. Loading my dataset String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); Using CSV, but could be Hive, JDBC, name it… /jgperrin/net.jgp.labs.sparkdq4ml
  • 57. +-----+-----+ |guest|price| +-----+-----+ |   1|23.24| |    2|30.89| |    2|33.74| |    3|34.89| |    3|29.91| |    3| 38.0| |    4| 40.0| |    5|120.0| |    6| 50.0| |    6|112.0| |    8| 60.0| |    8|127.0| |    8|120.0| |    9|130.0| +-----+-----+ Raw data, contains the anomalies
  • 58. Apply the rules String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 59. +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ |    1| 23.1|        23.1| |    2| 30.0|        30.0| |    2| 33.0|        33.0| |    3| 34.0|        34.0| |   24|142.0|       142.0| |   24|138.0|       138.0| |   25|  3.0|        -1.0| |   26| 10.0|        -1.0| |   25| 15.0|        -1.0| |   26|  4.0|        -1.0| |   28| 10.0|        -1.0| |   28|158.0|       158.0| |   30|170.0|       170.0| |   31|180.0|       180.0| +-----+-----+------------+ Anomalies are clearly identified by -1, so they can be easily filtered
  • 60. Filtering out anomalies String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 61. +-----+-----+ |guest|price| +-----+-----+ |    1| 23.1| |    2| 30.0| |    2| 33.0| |    3| 34.0| |    3| 30.0| |    4| 40.0| |   19|110.0| |   20|120.0| |   22|131.0| |   24|142.0| |   24|138.0| |   28|158.0| |   30|170.0| |   31|180.0| +-----+-----+ Useable data
  • 62. Format the data for ML ๏ Convert/Adapt dataset to Features and Label ๏ Required for Linear Regression in MLlib ๏Needs a column called label of type double ๏Needs a column called features of type VectorUDT
  • 63. Format the data for ML spark.udf().register( "vectorBuilder", new VectorBuilder(), new VectorUDT()); df = df.withColumn("label", df.col("price")); df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest"))); 
 // ... Lots of complex ML code goes here ... double p = model.predict(features); System.out.println("Prediction for " + feature + " guests is " + p); /jgperrin/net.jgp.labs.sparkdq4ml
  • 64. +-----+-----+-----+--------+------------------+ |guest|price|label|features|        prediction| +-----+-----+-----+--------+------------------+ |    1| 23.1| 23.1|   [1.0]|24.563807596513133| |    2| 30.0| 30.0|   [2.0]|29.595283312577884| |    2| 33.0| 33.0|   [2.0]|29.595283312577884| |    3| 34.0| 34.0|   [3.0]| 34.62675902864264| |    3| 30.0| 30.0|   [3.0]| 34.62675902864264| |    3| 38.0| 38.0|   [3.0]| 34.62675902864264| |    4| 40.0| 40.0|   [4.0]| 39.65823474470739| |   14| 89.0| 89.0|  [14.0]| 89.97299190535493| |   16|102.0|102.0|  [16.0]|100.03594333748444| |   20|120.0|120.0|  [20.0]|120.16184620174346| |   22|131.0|131.0|  [22.0]|130.22479763387295| |   24|142.0|142.0|  [24.0]|140.28774906600245| +-----+-----+-----+--------+------------------+ Prediction for 40.0 guests is 220.79136052303852 Prediction for 40 guests
  • 65. (the complex ML code) LinearRegression lr = new LinearRegression() .setMaxIter(40) .setRegParam(1) .setElasticNetParam(1); LinearRegressionModel model = lr.fit(df); Double feature = 40.0; Vector features = Vectors.dense(40.0); double p = model.predict(features); /jgperrin/net.jgp.labs.sparkdq4ml Define algorithms and its (hyper)parameters Created a model from our data Apply the model to a new dataset: predict
  • 66. It’s all about the base model Same model Trainer ModelDataset #1 ModelDataset #2 Predicted Data Step 1: Learning phase Step 2..n: Predictive phase
  • 68. A (Big) Data Scenario Data Raw Data Ingestion DataQuality Pure Data Transformation Rich Data Load/Publish Data
  • 69. Key takeaways ๏ Big Data is easier than one could think ๏ Java is the way to go (or Python) ๏ New vocabulary for using Spark ๏ You have a friend to help (ok, me) ๏ Spark is fun ๏ Spark is easily extensible
  • 70. Going further ๏ Contact me @jgperrin ๏ Join the Spark User mailing list ๏ Get help from Stack Overflow ๏ fb.com/TriangleSpark
  • 71. Going further Spark in action (Second edition, MEAP) by Jean Georges Perrin published by Manning http://jgp.net/sia sparkjava-65CE ctwato18 One free book 40% off
  • 74. Spark in Action Second edition, MEAP by Jean Georges Perrin published by Manning http://jgp.net/sia
  • 75. Credits Photos by Pexels IBM PC XT by Ruben de Rijcke - http://dendmedia.com/ vintage/ - Own work, CC BY 3.0, https:// commons.wikimedia.org/w/index.php?curid=3610862 Illustrations © Jean Georges Perrin
  • 76. No more slides You’re on your own!