SlideShare a Scribd company logo
1 of 42
Download to read offline
Spark Overview
Lisa Hua
7/14/2014
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
What is Spark
● Hadoop/YARN:
○ strong in processing large files parallelly
○ synchronization barrier when persisting data to the
disk.
○ MapReduce: launch mapper & reducer, R/W to disk,
back to queue and get resource
● Spark:
○ in-memory processing
○ iterative and interactive data analysis
○ compare to MapReduce, supports more complex
and interactive applications
Hadoop MapReduce
● Slow due to replication, serialization, and disk IO
● Inefficient for:
○ Iterative algorithms (Machine Learning, Graphs &
Network Analysis)
○ Interactive Data Mining (R, Excel, Searching)
Spark In-memory Processing
1.Extract a working set
2.Cache it
3.Query it repeatedly
Spark Ecosystem
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
How Spark Works - SparkContext
How Spark Works - RDD
● Partitions of Data
● Dependencies between partitions
Storage Types:
MEMORY_ONLY,
MEMORY_AND_DISK,
DISK_ONLY,
...
How Spark Works - RDD operations
Transformations
● Create a new
dataset from an
existing one.
● Lazy in nature,
executed only
when some action
is performed.
● Example
○ Map(func)
○ Filter(func)
○ Distinct()
Actions
● Returns a value or
exports data after
performing a
computation.
● Example:
○ Count()
○ Reduce(func)
○ Collect
○ Take()
Persistence
● Caching dataset
in-memory for
future operations
● store on disk or
RAM or mixed
● Example:
○ Persist()
○ Cache()
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works - Actions
● Parallel Operations
How Spark Works - Actions
● Parallel Operations
How Spark Works - Stages
Each stage is executed as a series
of Task (one Task for each
Partition).
DAG (Directed Acyclic Graph).
Spark Programming - Tasks
Task is the fundamental unit of execution in Spark
How Spark Works - Summary
● SparkContext
● Resilient Distributed Datasets (RDDs)
● Parallel Operations
● Shared Variables
○ Broadcast Variables - read-only
○ Accumulators
Compare Hadoop and Spark
Traditional OS Hadoop Spark
Storage File System HDFS HDFS
Schedule Processes MapReduce Computation Graph
I/O Disk Cache(in memory) and
shared data
Fault
Tolerance
Duplication and Disk
I/O
Hash partition and auto-
reconstruction
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
Spark - LogisticRegressionModel
1. Initialize spark JavaSparkContext
2. Prepare dataSet
3. Train LR model
4. Evaluation
1. Initializing Spark
1. JavaSparkContext: tell Spark how to access to the cluster
2. SparkConf: setting - a hashmap of <String,String>
a. required: AppName, Master, more default configuration
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);
2. Prepare Dataset
1. From Parallelized Collections
2. From External DataSets
3. Passing Functions to Spark
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
JavaRDD<String> distFile = sc.textFile("data.txt"); OR
JavaRDD<String> distFile = sc.textFile("hdfs://data.txt");
class ParseLabeledPoint implements Function<String, LabeledPoint> {
public LabeledPoint call(String s) {...
for (int i = 0; i < len; i++) {
x[i] = Double.parseDouble(tokens[i]);
}
return new LabeledPoint(y, Vectors.dense(x));
}}
---
JavaRDD<LabeledPoint> data = distData.map(new ParseLabeledPoint()) ;
3. Train LogisticRegressionModel
/*
* @param input RDD of (label, array of features) pairs.
* @param numIterations Number of iterations of gradient descent to run.
* @param stepSize Step size to be used for each iteration of gradient descent.
* @param miniBatchFraction Fraction of data to be used per iteration.
*/
LogisticRegressionModel lrModel = LogisticRegressionWithSGD.train(data, iterations,
stepSize,miniBatchFraction);
Train the model
4. Calculate Score - Evaluation
pmmlModel = new PMMLSparkLogisticRegressionModel()
.adaptMLModelToPMML(lrModel, partialPmmlModel);
1. Convert LogisticRegressionModel to PMML model
2. Prepare DataSet and calculate score
//use LogisticRegressionModel
JavaRDD<Vector> evalVectors = lines.map(new ParseVector());
List<Double> evalList = lrModel.predict(evalVectors).collect();
//use PMMLEvaluator
RegressionModelEvaluator evaluator = new RegressionModelEvaluator(pmml);
List<Double> evalResult = evaluator.evaluate(evalData);
//compare two evaluator results
for (...) {
Assert.assertEqual(getPMMLEvaluatorResult(i),sparkEvalList.get(i),DELTA);
}
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
Why Spark? - scalability & performance
1. leverage the memory of the cluster for in-
memory processing
2. Computation Graph optimization for parallel
execution
Shark: Spark SQL, Hive in Spark
Hive: manage large dataset in
distributed storage
Why Spark? - compatibility
1. compatible with HDFS, HBase, and any
Hadoop storage system
Why Spark? - Ease of Use API
1. Expressive API in Java, Scala, and Python
2. Supports more parallel operations
Expressive API - MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Expressive API - Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Scala:
val file = spark.textFile("hdfs://...")
val counts = file.map(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Java 6, Java 7:
JavaRDD<String> file = spark.textFile("hdfs://...");
JavaRDD<String> words = file.map(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String,
String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String,
Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new
Function2<Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }});
Expressive API - Spark
Why Spark? - Third Party Softwares
● Mahout
○ Say goodbye to MapReduce
○ Support for Apache Spark
■ Mahout-Spark Shell: facilitate the Mahout data
structures, such as Matrix, etc.
○ Support for h2o being explored
○ Support for Apache Flink possibly in future
● H2o
○ Sparkling water - embrace in-memory
processing with ML algorithm
Purpose Language Storage Stakeholder
H2o In-memory ML
predictive analysis
Java/R K/V
store
data analyst
Spark in-memory
processing engine
Scala, support
Java/Python
RDD HDFS user
Why Spark - Third Party Software
● Pig on Spark - Spork
● Other commercial softwares
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
How to use Spark in Shifu?
1. train: LogisticRegressionTrainer
2. stats & normalize
3. eval: add more evaluation metrics
a. precision, recall, F-measure, precision-recall curve
- pr(), precisionByThreshold(),recallByThreshold()..
b. area under the curves (AUC) - areaUnderPR()
c. receiver operating characteristic (ROC) - areaUnderROC(), roc()
Related Projects
1. Bulk Synchronous Parallel
a. parallel computing on message-passing
b. BSP: local computation, global communication,
barrier synchronization
c. graph processing: Pregel, Giraph
d. scientific computing: Hama
e. optimize operation DAG: Flink
Seconds
Nodes
Take Away - Big Data has moved in-memory
1. In-memory big data has come of age.
2. Spark leverages the cluster memory for
iterative and interactive operations
3. Spark is compatible with HDFS, HBase, and
any Hadoop storage system
4. Spark powers a stack of high-level tools
including Spark SQL, MLlib for machine
learning, GraphX, and Spark Streaming
5. Spark has expressive API
Questions
3. Train LogisticRegressionModel (cont.)
val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)
val weights =
if (addIntercept) { ...
} else { weightsWithIntercept }
2. Calculate weights
3. Gradient Descent optimize()
4. Training Error - not accessible from LogisticRegressionModel
logInfo("Last 10 stochastic losses %s".format(stochasticLoss.takeRight(10)))
14/07/09 14:10:40 INFO optimization.GradientDescent: Last 10 stochastic losses 0.6931471805599468,
0.5255572298404575,.., 0.3444544005102222, 0.3355921369255156

More Related Content

What's hot

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

What's hot (20)

Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 

Viewers also liked

Playframework + Twitter Bootstrap
Playframework + Twitter BootstrapPlayframework + Twitter Bootstrap
Playframework + Twitter Bootstrap
Kevingo Tsai
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 

Viewers also liked (20)

Large scale machine learning at pay pal risk
Large scale machine learning at pay pal riskLarge scale machine learning at pay pal risk
Large scale machine learning at pay pal risk
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Beginning scala 02 15
Beginning scala 02 15Beginning scala 02 15
Beginning scala 02 15
 
Playframework + Twitter Bootstrap
Playframework + Twitter BootstrapPlayframework + Twitter Bootstrap
Playframework + Twitter Bootstrap
 
Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database Choices
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Solr: Search at the Speed of Light
Solr: Search at the Speed of LightSolr: Search at the Speed of Light
Solr: Search at the Speed of Light
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 

Similar to Spark overview

Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
Supriya .
 

Similar to Spark overview (20)

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Spark core
Spark coreSpark core
Spark core
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Spark overview

  • 2. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 3. What is Spark ● Hadoop/YARN: ○ strong in processing large files parallelly ○ synchronization barrier when persisting data to the disk. ○ MapReduce: launch mapper & reducer, R/W to disk, back to queue and get resource ● Spark: ○ in-memory processing ○ iterative and interactive data analysis ○ compare to MapReduce, supports more complex and interactive applications
  • 4. Hadoop MapReduce ● Slow due to replication, serialization, and disk IO ● Inefficient for: ○ Iterative algorithms (Machine Learning, Graphs & Network Analysis) ○ Interactive Data Mining (R, Excel, Searching)
  • 5. Spark In-memory Processing 1.Extract a working set 2.Cache it 3.Query it repeatedly
  • 7. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 8. How Spark Works - SparkContext
  • 9. How Spark Works - RDD ● Partitions of Data ● Dependencies between partitions Storage Types: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, ...
  • 10. How Spark Works - RDD operations Transformations ● Create a new dataset from an existing one. ● Lazy in nature, executed only when some action is performed. ● Example ○ Map(func) ○ Filter(func) ○ Distinct() Actions ● Returns a value or exports data after performing a computation. ● Example: ○ Count() ○ Reduce(func) ○ Collect ○ Take() Persistence ● Caching dataset in-memory for future operations ● store on disk or RAM or mixed ● Example: ○ Persist() ○ Cache()
  • 11. How Spark Works: Word Count
  • 12. How Spark Works: Word Count
  • 13. How Spark Works: Word Count
  • 14. How Spark Works: Word Count
  • 15. How Spark Works: Word Count
  • 16. How Spark Works - Actions ● Parallel Operations
  • 17. How Spark Works - Actions ● Parallel Operations
  • 18. How Spark Works - Stages Each stage is executed as a series of Task (one Task for each Partition). DAG (Directed Acyclic Graph).
  • 19. Spark Programming - Tasks Task is the fundamental unit of execution in Spark
  • 20. How Spark Works - Summary ● SparkContext ● Resilient Distributed Datasets (RDDs) ● Parallel Operations ● Shared Variables ○ Broadcast Variables - read-only ○ Accumulators
  • 21. Compare Hadoop and Spark Traditional OS Hadoop Spark Storage File System HDFS HDFS Schedule Processes MapReduce Computation Graph I/O Disk Cache(in memory) and shared data Fault Tolerance Duplication and Disk I/O Hash partition and auto- reconstruction
  • 22. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 23. Spark - LogisticRegressionModel 1. Initialize spark JavaSparkContext 2. Prepare dataSet 3. Train LR model 4. Evaluation
  • 24. 1. Initializing Spark 1. JavaSparkContext: tell Spark how to access to the cluster 2. SparkConf: setting - a hashmap of <String,String> a. required: AppName, Master, more default configuration SparkConf conf = new SparkConf().setAppName(appName).setMaster(master); JavaSparkContext sc = new JavaSparkContext(conf);
  • 25. 2. Prepare Dataset 1. From Parallelized Collections 2. From External DataSets 3. Passing Functions to Spark List<Integer> data = Arrays.asList(1, 2, 3, 4, 5); JavaRDD<Integer> distData = sc.parallelize(data); JavaRDD<String> distFile = sc.textFile("data.txt"); OR JavaRDD<String> distFile = sc.textFile("hdfs://data.txt"); class ParseLabeledPoint implements Function<String, LabeledPoint> { public LabeledPoint call(String s) {... for (int i = 0; i < len; i++) { x[i] = Double.parseDouble(tokens[i]); } return new LabeledPoint(y, Vectors.dense(x)); }} --- JavaRDD<LabeledPoint> data = distData.map(new ParseLabeledPoint()) ;
  • 26. 3. Train LogisticRegressionModel /* * @param input RDD of (label, array of features) pairs. * @param numIterations Number of iterations of gradient descent to run. * @param stepSize Step size to be used for each iteration of gradient descent. * @param miniBatchFraction Fraction of data to be used per iteration. */ LogisticRegressionModel lrModel = LogisticRegressionWithSGD.train(data, iterations, stepSize,miniBatchFraction); Train the model
  • 27. 4. Calculate Score - Evaluation pmmlModel = new PMMLSparkLogisticRegressionModel() .adaptMLModelToPMML(lrModel, partialPmmlModel); 1. Convert LogisticRegressionModel to PMML model 2. Prepare DataSet and calculate score //use LogisticRegressionModel JavaRDD<Vector> evalVectors = lines.map(new ParseVector()); List<Double> evalList = lrModel.predict(evalVectors).collect(); //use PMMLEvaluator RegressionModelEvaluator evaluator = new RegressionModelEvaluator(pmml); List<Double> evalResult = evaluator.evaluate(evalData); //compare two evaluator results for (...) { Assert.assertEqual(getPMMLEvaluatorResult(i),sparkEvalList.get(i),DELTA); }
  • 28. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 29. Why Spark? - scalability & performance 1. leverage the memory of the cluster for in- memory processing 2. Computation Graph optimization for parallel execution Shark: Spark SQL, Hive in Spark Hive: manage large dataset in distributed storage
  • 30. Why Spark? - compatibility 1. compatible with HDFS, HBase, and any Hadoop storage system
  • 31. Why Spark? - Ease of Use API 1. Expressive API in Java, Scala, and Python 2. Supports more parallel operations
  • 32. Expressive API - MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); }}} public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 33. Expressive API - Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); }}} public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Scala: val file = spark.textFile("hdfs://...") val counts = file.map(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  • 34. public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); }}} public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Java 6, Java 7: JavaRDD<String> file = spark.textFile("hdfs://..."); JavaRDD<String> words = file.map(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; }}); Expressive API - Spark
  • 35. Why Spark? - Third Party Softwares ● Mahout ○ Say goodbye to MapReduce ○ Support for Apache Spark ■ Mahout-Spark Shell: facilitate the Mahout data structures, such as Matrix, etc. ○ Support for h2o being explored ○ Support for Apache Flink possibly in future ● H2o ○ Sparkling water - embrace in-memory processing with ML algorithm Purpose Language Storage Stakeholder H2o In-memory ML predictive analysis Java/R K/V store data analyst Spark in-memory processing engine Scala, support Java/Python RDD HDFS user
  • 36. Why Spark - Third Party Software ● Pig on Spark - Spork ● Other commercial softwares
  • 37. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 38. How to use Spark in Shifu? 1. train: LogisticRegressionTrainer 2. stats & normalize 3. eval: add more evaluation metrics a. precision, recall, F-measure, precision-recall curve - pr(), precisionByThreshold(),recallByThreshold().. b. area under the curves (AUC) - areaUnderPR() c. receiver operating characteristic (ROC) - areaUnderROC(), roc()
  • 39. Related Projects 1. Bulk Synchronous Parallel a. parallel computing on message-passing b. BSP: local computation, global communication, barrier synchronization c. graph processing: Pregel, Giraph d. scientific computing: Hama e. optimize operation DAG: Flink Seconds Nodes
  • 40. Take Away - Big Data has moved in-memory 1. In-memory big data has come of age. 2. Spark leverages the cluster memory for iterative and interactive operations 3. Spark is compatible with HDFS, HBase, and any Hadoop storage system 4. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming 5. Spark has expressive API
  • 42. 3. Train LogisticRegressionModel (cont.) val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept) val weights = if (addIntercept) { ... } else { weightsWithIntercept } 2. Calculate weights 3. Gradient Descent optimize() 4. Training Error - not accessible from LogisticRegressionModel logInfo("Last 10 stochastic losses %s".format(stochasticLoss.takeRight(10))) 14/07/09 14:10:40 INFO optimization.GradientDescent: Last 10 stochastic losses 0.6931471805599468, 0.5255572298404575,.., 0.3444544005102222, 0.3355921369255156