SlideShare a Scribd company logo
1 of 51
Download to read offline
1
Snehal Nagmote (Staff Software Engineer)
Yash Kotresh (Data Scientist)
Apache Spark And Machine Learning
@ Women Who Code Silicon Valley Meetup
Who Am I ?
 Snehal Nagmote
 Staff Software Engineer@WalmartLabs
 Linkedin
 https://www.linkedin.com/in/snehal-nagmote-
79651122
 Blog
 https://technografiti.wordpress.com/
Objectives
 What is Apache Spark
 Spark Basics
 Spark Runtime Architecture
 Working with RDDs
 Transformation and Actions
 Spark DataFrames
 Demo
What is Big Data ?
Sources of Big Data?
• User generated content
• Facebook, Instagram,Yelp,Twitter
• TripAdvisor
• Application Log Files – ApacheWeb Server Logs
• Online / streaming
• Internet ofThings (IoT)
We want to being able to handle data,query, build models,make
predictions, etc.
Apache Spark is afast and expressive cluster computing system
compatible withApache Hadoop.
• Handles batch, interactive,and real-time within a single
framework
• Native integration with Java, Python,Scala
• Programming at a higher level of abstraction
More general: map/reduce is just one set of supported
constructs
It improves computation performance by means of:
• In-memory computing primitives
• General computation graphs
What is Apache Spark ?
Mapreduce to Spark
Mapreduce
• public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
• private final static IntWritable one = new IntWritable(1); private Text word =
newText();
• public void map(LongWritable key, Text value,
• OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
• String line = value.toString();
• StringTokenizer itr = newStringTokenizer(line);
• while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
• }
• }
• }
• public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
• public void reduce(Text key, Iterator<IntWritable> values,
• OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
• int sum = 0;
• while (values.hasNext()) { sum +=
values.next().get();
• }
• output.collect(key, newIntWritable(sum));
• }
• }
Spark
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark Growth : Users
Spark Unified Stack
Let’s start with the basic concepts in:
using,respectively:
./bin/spark-shell
./bin/pyspark
alternatively, with IPython Notebook:
IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark
Spark Concepts: Shell
• First thing that a Spark program does is create a
SparkContext object, which tells Spark how to access a
cluster
• In the shell for either Scala or Python, this is the sc
variable, which is created automatically
• Other programs must use a constructor to
instantiate a new SparkContext
• Then in turn SparkContext gets used to create other
variables
Spark Concepts: SparkContext
= spark.SparkContext@470d1f30
Spark Concepts: SparkContext
Scala:
scala> sc
res: spark.SparkContext
Python:
>>> sc
<pyspark.context.SparkContext object at 0x7f7570783350>
The master parameter for a SparkContext determines which cluster to use
Spark Concepts:Master
master description
local run Spark locally with one worker
thread (no parallelism)
local[K] run Spark locally with K worker
threads (ideally set to # cores)
spark://HOST:PORT connect to a Spark standalone cluster;
PORT depends on config (7077 by
default)
mesos://HOST:PORT connect to a Mesos cluster;
PORT depends on config (5050 by default)
1. Driver Program connects to a cluster manager which
allocate resources across applications
2. Acquires executors on cluster nodes – worker processes to
run computations and store data
3. Sends app code to the executors
4. Sends tasks for the executors to run
5. Output is collected at Driver
Spark Concepts: Job Execution
Driver Program
SparkContext
Cluster Manager
Worker
Node
Exectuo
r
cache
tasktask
Worker
Node
Exectuo
r
cache
tasktask
Sp
Spark Concepts -Job Execution
The main goal of Spark is to provide the user anAPI to work with distributed
collections of data like if they were local.These collections are called RDD
(Resilient Distributed Dataset).
• Primary abstraction in Spark And they're immutable once they're
constructed
• Two types of operations on RDDs: transformations and actions
• Transformations are lazy (not computed immediately)
• The transformed RDD gets recomputed when an action is run on it (default)
• however,an RDD can be persistedinto storage in memory or disk
Spark Concepts: Resilient Distributed
DataSets (RDD)
Spark Concepts: RDD Internals
TwoWays to create RDD
• Parallelized collections – take an existing Scala/Python
collection and run functions on it in parallel
• Hadoop datasets – run functions on each record of a file
in Hadoop distributed file system or any other storage
system supported by Hadoop
Spark Concepts: RDD
Spark Concepts: RDD
>>> data = [1, 2, 3, 4, 5]
>>> data
[1, 2, 3, 4, 5]
>>> distData = sc.parallelize(data)
>>> distData
ParallelCollectionRDD[0] at parallelize at
PythonRDD.scala:229
Scala:
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: spark.RDD[Int] =
spark.ParallelCollection@10d13e3e
Python
:
Spark Concepts: RDD
calle
d
14/04/19 23:42:40 INFO storage.MemoryStore: Block broadcast_0
stored as values to memory (estimated size 36.0 KB, free 303.3
MB)
>>> distFile
MappedRDD[2] at textFile at NativeMethodAccessorImpl.java:-2
Scala:
scala> val distFile = sc.textFile("README.md")
distFile: spark.RDD[String] =
spark.HadoopRDD@1d4cee08
Python:
>>> distFile = sc.textFile("README.md")
14/04/19 23:42:40 INFO storage.MemoryStore:
ensureFreeSpace(36827) with curMem=0, maxMem=318111744
The operations that can be appliedon the RDDs havetwo main
types:
Transformations
These are the lazy operations to create new RDDs basedon other
RDDs. Example:
› map, filter, union, distinct, etc.
Actions
These are the operations that actually doessome computation andget the
results or write to disk. Example:
› count, collect, first
Spark Concepts: Transformations vs Actions
Spark Essentials: Transformations
transformation description
map(func) return a new distributed dataset formed by passing
each element of the source through a function func
filter(func)
return a new dataset formed by selecting those
elements of the source on which func returns true
flatMap(func)
similar to map,but each input item can be mapped
to 0 or more output items (so func should return a
Seq rather than a single item)
sample(withReplacement,
fraction, seed)
sample a fraction fraction of the data, with or without
replacement, using a given random number generator
seed
union(otherDataset) return a new dataset that contains the union of the
elements in the source dataset and the argument
distinct([numTasks])) return a new dataset that contains the distinct
elements of the source dataset
Spark Essentials: Transformations
transformation description
groupByKey([numTasks]) when called on a dataset of (K, V) pairs,
returns a dataset of (K, Seq[V]) pairs
reduceByKey(func,
[numTasks])
when called on a dataset of (K, V) pairs,returns
a dataset of (K, V) pairs where the values
for each key are aggregated using the given
reduce function
sortByKey([ascending],
[numTasks])
when called on a dataset of (K, V) pairs
where K implements Ordered, returns a
dataset of (K, V) pairs sorted by keys in
ascending or descending order, as specified in
the boolean ascending argument
join(otherDataset,
[numTasks])
when called on datasets of type (K, V) and
(K, W), returns a dataset of (K, (V, W))
pairs with all pairs of elements for each key
cogroup(otherDataset,
[numTasks])
when called on datasets of type (K, V) and
(K, W), returns a dataset of (K, Seq[V],
Seq[W]) tuples – also called groupWith
cartesian(otherDataset) when called on datasets of types T and U,
returns a dataset of (T, U) pairs (all pairs
of elements)
val distFile = sc.textFile("README.md")
val wordsRDD = distFile.map(l =>
l.split(" "))
wordsRDD.count()
Spark Essentials: Transformations
Scala:
Python:
distFile = sc.textFile("README.md")
//Split lines into words
wordsRDD=distFile.map(lambda x: x.split(' '))
wordsRDD.count()
distFile is a collection of lines
Spark Essentials: Transformations
Java 7:
JavaRDD<String> distFile =
sc.textFile("README.md");
// Map each line to multiple words
JavaRDD<String> words =
distFile.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String
line) {
return Arrays.asList(line.split(" "));
}
});
Java 8:
JavaRDD<String> distFile = sc.textFile("README.md");
<String> words =
distFile.flatMap(line ->
Arrays.asList(line.split("”));
Spark Essentials: Actions
action description
reduce(func)
aggregate the elements of the dataset using a
function func (which takes two arguments and
returns one), and should also be commutative
and associative so that it can be computed
correctly in parallel
collect()
return all the elements of the dataset as an array
at the driver program – usually useful after a
filter or other operation that returns a
sufficiently small subset of the data
count() return the number of elements in the dataset
first() return the first element of the dataset – similar to
take(1)
take(n)
return an array with the first n elements of the
dataset
– currently not executed in parallel, instead the
driver program computes all the elements
takeSample(withReplacement,
fraction, seed)
return an array with a random sample of num
elements of the dataset, with or without
replacement,using the given random number
generator seed
Spark Essentials: Actions
action description
saveAsTextFile(path)
write the elements of the dataset as a text file (or
set of text files) in a given directory in the local
filesystem, HDFS or any other Hadoop-supported
file system.
Spark will call toString on each element to
convert it to a line of text in the file
saveAsSequenceFile(path)
write the elements of the dataset as a Hadoop
SequenceFile in a given path in the local
filesystem, HDFS or any other Hadoop-
supported file system.
Only available on RDDs of key-value pairs that
either implement Hadoop's Writable
interface or are implicitly convertible to
Writable (Spark includes conversions for
basic types like Int,Double,String, etc).
countByKey() only available on RDDs of type (K, V). Returns a
`Map` of (K, Int) pairs with the count of each
key
foreach(func)
run a function func on each element of the
dataset – usually done for side effects such as
updating an accumulator variable or interacting
with external storage systems
Spark Essentials: Actions
Scala:
val f = sc.textFile("README.md")
val words = f.flatMap(l => l.split(" ")).map(word => (word, 1))
words.reduceByKey(_ + _).collect.foreach(println)
Python:
f = sc.textFile("README.md")
//Transformations
w = f.flatMap(lambda x: x.split(' ')).map(lambda
x:(x, 1))
w.reduceByKey(add).collect()
Filtering and counting a big log:
>>> rdd_log=sc.textFile( 'nginx_access.log')
>>> rdd_log.filter(lambda l: 'x.html' in l).count() 238
Collecting the interesting lines:
>>> rdd_log=sc.textFile( 'nginx_access.log')
>>> lines=rdd_log.filter( lambda l: 'x.html' in l).collect()
>>> lines
['201.140.8.128 [19/Jun/2012:09:17:31 +0100] 
"GET /x.html HTTP/1.1"', (...)]
Example:Transformations And Actions
• Spark can persist (or cache) a dataset in memory
across operations
• Each node stores in memory any slices of it that it
computes and reuses them in other actions on that dataset
– often making future actions more than 10x faster
• The cache is fault-tolerant: if any partition of an RDD is
lost, it will automatically be recomputed using the
transformations that originally created it
Spark Essentials: Persistence
Spark Essentials: Persistence
transformation description
MEMORY_ONLY
Store RDD as deserialized Java objects in the JVM.
If the RDD does not fit in memory,some
partitions will not be cached and will be
recomputed on the fly each time they're
needed.This is the default level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM.
If the RDD does not fit in memory,store the
partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER
Store RDD as serialized Java objects (one byte
array per partition).This is generally more
space-efficient than deserialized objects,
especially when using a fast serializer,but more
CPU-intensive to read.
MEMORY_AND_DISK_SER
Similar to MEMORY_ONLY_SER, but spill
partitions that don't fit in memory to disk
instead of recomputing them on the fly each time
they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc
Same as the levels above, but replicate each
partition on two cluster nodes.
Spark Essentials: Persistence
Scala:
val f = sc.textFile("README.md")
val w = f.flatMap(l => l.split(" ")).map(word => (word,
1)).cache()
w.reduceByKey(_ + _).collect.foreach(println)
Python:
from operator import add
f = sc.textFile("README.md")
w = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
1)).cache()
w.reduceByKey(add).collect()
• Distributed collection of data grouped into named columns
(i.e. RDD with schema)
• Support for a wide array of data formats and storage
systems
• Domain-specific functions designed for common tasks
– Metadata
– Sampling
– Project, filter, aggregation, join, …
– UDFs
• Available in Python, Scala, Java, and R (via SparkR)
Spark Concepts : DataFrames
Example : How to use DataFrame?
# Constructs a DataFrame from the users table in Hive.
users = context.table("users")
# from JSON files in S3
logs = context.load("s3n://path/to/data.json", "json")
# Create a new DataFrame that contains "young users" only
young = users.filter(users.age < 21)
# Alternatively, using Pandas-like syntax
young = users[users.age < 21]
# Increment everybody's age by 1
young.select(young.name, young.age + 1)
# Count the number of young users by gender
young.groupBy("gender").count()
# Join young users with another DataFrame called logs
young.join(logs, logs.userId == users.userId, "left_outer")
Data Sources supported by DataFrames
built-in external
JDBC
{ JSON }
and more …
3
5
10
Spark Python DF
Spark Scala DF
RDD Python
RDD Scala
0 2 4 6 8
Runtime performance of aggregating 10million int
pairs (secs)
DataFrames Performance Comparison
What did we Cover ?
 What is Apache Spark
 Spark Basics & Execution Model
 Working with RDDs
 Transformation and Actions
 Spark DataFrames
 Demo
Spark and Machine Learning
ML(Supervised) Quick Recap
Training Data
Loss Function
Optimize
Learning function
New Data Predictions
Machine Learning
Feature Vectors
Feature Vectors
ML Tools available
• Yet another Machine Learning Library (True!!)
• It is a Spark subproject providing machine learning primitives
• AMPLab, UC Berkeley
• In-built Spark 0.8 onwards
• It is built on Apache Spark, a fast and general engine for
large-scale data processing
• Write applications quickly in Java, Scala, or Python
MLib: What?
MLib: Algorithms (Supervised)
• Classification
• Logistic regression
• Linear support vector machine(SVM)
• Naive Bayes
• Regression
• Generalized linear regression (GLM)
MLib: Algorithms (Logistic Regression)
• Collaborative filtering
• Alternating least squares (ALS)
• Clustering
• K-means
• Decomposition:
• Singular value decomposition (SVD)
• Principal component analysis (PCA)
MLib: Algorithms (Un-supervised)
MLib: Algorithms (Collaborative-
Filtering)
• Example from :
https://github.com/szilard/benchm-ml
• Data: Training datasets of sizes 10K, 100K, 1M,
10M are generated from the well-known airline
dataset (using years 2005 and 2006)
• A test set of size 100K is generated from the
same (using year 2007)
• Task : predict whether a flight will be delayed by
more than 15 minutes
• Machine Trained on : 32 cores, 60GB RAM
Benchmark
Tool
Training
Examples
Time (sec) RAM (GB) AUC
Python 10K 0.2 2 67.6
100K 2 3 70.6
1M 25 12 71.1
10M Crash 71.1
Spark 10K 1 1 66.6
100K 2 1 70.2
1M 5 2 70.9
10M 35 10 70.9
Benchmark (Logisitic Regression*)
*https://github.com/szilard/benchm-ml
Another way to use ML in spark
• Scikit learn(Or any ML library) + Hyper parameter Parallelization
Data+ML Library
Hyperparameter RDD
Data+ML algo+
Hyperparameter(Set 1)
Data+ML algo+
Hyperparameter(Set 2)
Data+ML algo+
Hyperparameter(Set3)
How we use Spark in WalmartLabs?
Cold Start problem: Background
Meetup ml spark_ppt

More Related Content

What's hot

What's hot (20)

Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 

Viewers also liked

Better Visibility into Spark Execution for Faster Application Development-(S...
 Better Visibility into Spark Execution for Faster Application Development-(S... Better Visibility into Spark Execution for Faster Application Development-(S...
Better Visibility into Spark Execution for Faster Application Development-(S...
Spark Summit
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 

Viewers also liked (7)

Better Visibility into Spark Execution for Faster Application Development-(S...
 Better Visibility into Spark Execution for Faster Application Development-(S... Better Visibility into Spark Execution for Faster Application Development-(S...
Better Visibility into Spark Execution for Faster Application Development-(S...
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 

Similar to Meetup ml spark_ppt

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 

Similar to Meetup ml spark_ppt (20)

Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Meetup ml spark_ppt

  • 1. 1 Snehal Nagmote (Staff Software Engineer) Yash Kotresh (Data Scientist) Apache Spark And Machine Learning @ Women Who Code Silicon Valley Meetup
  • 2. Who Am I ?  Snehal Nagmote  Staff Software Engineer@WalmartLabs  Linkedin  https://www.linkedin.com/in/snehal-nagmote- 79651122  Blog  https://technografiti.wordpress.com/
  • 3. Objectives  What is Apache Spark  Spark Basics  Spark Runtime Architecture  Working with RDDs  Transformation and Actions  Spark DataFrames  Demo
  • 4. What is Big Data ?
  • 5. Sources of Big Data? • User generated content • Facebook, Instagram,Yelp,Twitter • TripAdvisor • Application Log Files – ApacheWeb Server Logs • Online / streaming • Internet ofThings (IoT) We want to being able to handle data,query, build models,make predictions, etc.
  • 6. Apache Spark is afast and expressive cluster computing system compatible withApache Hadoop. • Handles batch, interactive,and real-time within a single framework • Native integration with Java, Python,Scala • Programming at a higher level of abstraction More general: map/reduce is just one set of supported constructs It improves computation performance by means of: • In-memory computing primitives • General computation graphs What is Apache Spark ?
  • 7. Mapreduce to Spark Mapreduce • public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { • private final static IntWritable one = new IntWritable(1); private Text word = newText(); • public void map(LongWritable key, Text value, • OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { • String line = value.toString(); • StringTokenizer itr = newStringTokenizer(line); • while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); • } • } • } • public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { • public void reduce(Text key, Iterator<IntWritable> values, • OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { • int sum = 0; • while (values.hasNext()) { sum += values.next().get(); • } • output.collect(key, newIntWritable(sum)); • } • } Spark val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 10. Let’s start with the basic concepts in: using,respectively: ./bin/spark-shell ./bin/pyspark alternatively, with IPython Notebook: IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark Spark Concepts: Shell
  • 11. • First thing that a Spark program does is create a SparkContext object, which tells Spark how to access a cluster • In the shell for either Scala or Python, this is the sc variable, which is created automatically • Other programs must use a constructor to instantiate a new SparkContext • Then in turn SparkContext gets used to create other variables Spark Concepts: SparkContext
  • 12. = spark.SparkContext@470d1f30 Spark Concepts: SparkContext Scala: scala> sc res: spark.SparkContext Python: >>> sc <pyspark.context.SparkContext object at 0x7f7570783350>
  • 13. The master parameter for a SparkContext determines which cluster to use Spark Concepts:Master master description local run Spark locally with one worker thread (no parallelism) local[K] run Spark locally with K worker threads (ideally set to # cores) spark://HOST:PORT connect to a Spark standalone cluster; PORT depends on config (7077 by default) mesos://HOST:PORT connect to a Mesos cluster; PORT depends on config (5050 by default)
  • 14. 1. Driver Program connects to a cluster manager which allocate resources across applications 2. Acquires executors on cluster nodes – worker processes to run computations and store data 3. Sends app code to the executors 4. Sends tasks for the executors to run 5. Output is collected at Driver Spark Concepts: Job Execution Driver Program SparkContext Cluster Manager Worker Node Exectuo r cache tasktask Worker Node Exectuo r cache tasktask
  • 16. The main goal of Spark is to provide the user anAPI to work with distributed collections of data like if they were local.These collections are called RDD (Resilient Distributed Dataset). • Primary abstraction in Spark And they're immutable once they're constructed • Two types of operations on RDDs: transformations and actions • Transformations are lazy (not computed immediately) • The transformed RDD gets recomputed when an action is run on it (default) • however,an RDD can be persistedinto storage in memory or disk Spark Concepts: Resilient Distributed DataSets (RDD)
  • 17. Spark Concepts: RDD Internals
  • 18. TwoWays to create RDD • Parallelized collections – take an existing Scala/Python collection and run functions on it in parallel • Hadoop datasets – run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop Spark Concepts: RDD
  • 19. Spark Concepts: RDD >>> data = [1, 2, 3, 4, 5] >>> data [1, 2, 3, 4, 5] >>> distData = sc.parallelize(data) >>> distData ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229 Scala: scala> val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala> val distData = sc.parallelize(data) distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e Python :
  • 20. Spark Concepts: RDD calle d 14/04/19 23:42:40 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 36.0 KB, free 303.3 MB) >>> distFile MappedRDD[2] at textFile at NativeMethodAccessorImpl.java:-2 Scala: scala> val distFile = sc.textFile("README.md") distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08 Python: >>> distFile = sc.textFile("README.md") 14/04/19 23:42:40 INFO storage.MemoryStore: ensureFreeSpace(36827) with curMem=0, maxMem=318111744
  • 21. The operations that can be appliedon the RDDs havetwo main types: Transformations These are the lazy operations to create new RDDs basedon other RDDs. Example: › map, filter, union, distinct, etc. Actions These are the operations that actually doessome computation andget the results or write to disk. Example: › count, collect, first Spark Concepts: Transformations vs Actions
  • 22. Spark Essentials: Transformations transformation description map(func) return a new distributed dataset formed by passing each element of the source through a function func filter(func) return a new dataset formed by selecting those elements of the source on which func returns true flatMap(func) similar to map,but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item) sample(withReplacement, fraction, seed) sample a fraction fraction of the data, with or without replacement, using a given random number generator seed union(otherDataset) return a new dataset that contains the union of the elements in the source dataset and the argument distinct([numTasks])) return a new dataset that contains the distinct elements of the source dataset
  • 23. Spark Essentials: Transformations transformation description groupByKey([numTasks]) when called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs reduceByKey(func, [numTasks]) when called on a dataset of (K, V) pairs,returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function sortByKey([ascending], [numTasks]) when called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument join(otherDataset, [numTasks]) when called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key cogroup(otherDataset, [numTasks]) when called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples – also called groupWith cartesian(otherDataset) when called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements)
  • 24. val distFile = sc.textFile("README.md") val wordsRDD = distFile.map(l => l.split(" ")) wordsRDD.count() Spark Essentials: Transformations Scala: Python: distFile = sc.textFile("README.md") //Split lines into words wordsRDD=distFile.map(lambda x: x.split(' ')) wordsRDD.count() distFile is a collection of lines
  • 25. Spark Essentials: Transformations Java 7: JavaRDD<String> distFile = sc.textFile("README.md"); // Map each line to multiple words JavaRDD<String> words = distFile.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String line) { return Arrays.asList(line.split(" ")); } }); Java 8: JavaRDD<String> distFile = sc.textFile("README.md"); <String> words = distFile.flatMap(line -> Arrays.asList(line.split("”));
  • 26. Spark Essentials: Actions action description reduce(func) aggregate the elements of the dataset using a function func (which takes two arguments and returns one), and should also be commutative and associative so that it can be computed correctly in parallel collect() return all the elements of the dataset as an array at the driver program – usually useful after a filter or other operation that returns a sufficiently small subset of the data count() return the number of elements in the dataset first() return the first element of the dataset – similar to take(1) take(n) return an array with the first n elements of the dataset – currently not executed in parallel, instead the driver program computes all the elements takeSample(withReplacement, fraction, seed) return an array with a random sample of num elements of the dataset, with or without replacement,using the given random number generator seed
  • 27. Spark Essentials: Actions action description saveAsTextFile(path) write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file saveAsSequenceFile(path) write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop- supported file system. Only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int,Double,String, etc). countByKey() only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key foreach(func) run a function func on each element of the dataset – usually done for side effects such as updating an accumulator variable or interacting with external storage systems
  • 28. Spark Essentials: Actions Scala: val f = sc.textFile("README.md") val words = f.flatMap(l => l.split(" ")).map(word => (word, 1)) words.reduceByKey(_ + _).collect.foreach(println) Python: f = sc.textFile("README.md") //Transformations w = f.flatMap(lambda x: x.split(' ')).map(lambda x:(x, 1)) w.reduceByKey(add).collect()
  • 29. Filtering and counting a big log: >>> rdd_log=sc.textFile( 'nginx_access.log') >>> rdd_log.filter(lambda l: 'x.html' in l).count() 238 Collecting the interesting lines: >>> rdd_log=sc.textFile( 'nginx_access.log') >>> lines=rdd_log.filter( lambda l: 'x.html' in l).collect() >>> lines ['201.140.8.128 [19/Jun/2012:09:17:31 +0100] "GET /x.html HTTP/1.1"', (...)] Example:Transformations And Actions
  • 30. • Spark can persist (or cache) a dataset in memory across operations • Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster • The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it Spark Essentials: Persistence
  • 31. Spark Essentials: Persistence transformation description MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory,some partitions will not be cached and will be recomputed on the fly each time they're needed.This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory,store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition).This is generally more space-efficient than deserialized objects, especially when using a fast serializer,but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc Same as the levels above, but replicate each partition on two cluster nodes.
  • 32. Spark Essentials: Persistence Scala: val f = sc.textFile("README.md") val w = f.flatMap(l => l.split(" ")).map(word => (word, 1)).cache() w.reduceByKey(_ + _).collect.foreach(println) Python: from operator import add f = sc.textFile("README.md") w = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).cache() w.reduceByKey(add).collect()
  • 33. • Distributed collection of data grouped into named columns (i.e. RDD with schema) • Support for a wide array of data formats and storage systems • Domain-specific functions designed for common tasks – Metadata – Sampling – Project, filter, aggregation, join, … – UDFs • Available in Python, Scala, Java, and R (via SparkR) Spark Concepts : DataFrames
  • 34. Example : How to use DataFrame? # Constructs a DataFrame from the users table in Hive. users = context.table("users") # from JSON files in S3 logs = context.load("s3n://path/to/data.json", "json") # Create a new DataFrame that contains "young users" only young = users.filter(users.age < 21) # Alternatively, using Pandas-like syntax young = users[users.age < 21] # Increment everybody's age by 1 young.select(young.name, young.age + 1) # Count the number of young users by gender young.groupBy("gender").count() # Join young users with another DataFrame called logs young.join(logs, logs.userId == users.userId, "left_outer")
  • 35. Data Sources supported by DataFrames built-in external JDBC { JSON } and more … 3 5
  • 36. 10 Spark Python DF Spark Scala DF RDD Python RDD Scala 0 2 4 6 8 Runtime performance of aggregating 10million int pairs (secs) DataFrames Performance Comparison
  • 37. What did we Cover ?  What is Apache Spark  Spark Basics & Execution Model  Working with RDDs  Transformation and Actions  Spark DataFrames  Demo
  • 38. Spark and Machine Learning
  • 39. ML(Supervised) Quick Recap Training Data Loss Function Optimize Learning function New Data Predictions Machine Learning Feature Vectors Feature Vectors
  • 41. • Yet another Machine Learning Library (True!!) • It is a Spark subproject providing machine learning primitives • AMPLab, UC Berkeley • In-built Spark 0.8 onwards • It is built on Apache Spark, a fast and general engine for large-scale data processing • Write applications quickly in Java, Scala, or Python MLib: What?
  • 42. MLib: Algorithms (Supervised) • Classification • Logistic regression • Linear support vector machine(SVM) • Naive Bayes • Regression • Generalized linear regression (GLM)
  • 44. • Collaborative filtering • Alternating least squares (ALS) • Clustering • K-means • Decomposition: • Singular value decomposition (SVD) • Principal component analysis (PCA) MLib: Algorithms (Un-supervised)
  • 46. • Example from : https://github.com/szilard/benchm-ml • Data: Training datasets of sizes 10K, 100K, 1M, 10M are generated from the well-known airline dataset (using years 2005 and 2006) • A test set of size 100K is generated from the same (using year 2007) • Task : predict whether a flight will be delayed by more than 15 minutes • Machine Trained on : 32 cores, 60GB RAM Benchmark
  • 47. Tool Training Examples Time (sec) RAM (GB) AUC Python 10K 0.2 2 67.6 100K 2 3 70.6 1M 25 12 71.1 10M Crash 71.1 Spark 10K 1 1 66.6 100K 2 1 70.2 1M 5 2 70.9 10M 35 10 70.9 Benchmark (Logisitic Regression*) *https://github.com/szilard/benchm-ml
  • 48. Another way to use ML in spark • Scikit learn(Or any ML library) + Hyper parameter Parallelization Data+ML Library Hyperparameter RDD Data+ML algo+ Hyperparameter(Set 1) Data+ML algo+ Hyperparameter(Set 2) Data+ML algo+ Hyperparameter(Set3)
  • 49. How we use Spark in WalmartLabs?
  • 50. Cold Start problem: Background