SlideShare a Scribd company logo
1 of 27
Apache Spark 
In-Memory Data Processing 
September 2014 Meetup 
Organized by Big Data Hyderabad Meetup Group. 
http://www.meetup.com/Big-Data-Hyderabad/ 
Rahul Jain 
@rahuldausa
Agenda 
• Why Spark 
• Introduction 
• Basics 
• Hands-on 
– Installation 
– Examples 
2
Quick Questionnaire 
How many people know/work on Scala ? 
How many people know/work on Python ? 
How many people know/heard/are using Spark ?
Why Spark ? 
• Most of Machine Learning Algorithms are iterative because each iteration 
can improve the results 
• With Disk based approach each iteration’s output is written to disk making 
it slow 
Hadoop execution flow 
Spark execution flow 
http://www.wiziq.com/blog/hype-around-apache-spark/
About Apache Spark 
• Initially started at UC Berkeley in 2009 
• Fast and general purpose cluster computing system 
• 10x (on disk) - 100x (In-Memory) faster 
• Most popular for running Iterative Machine Learning Algorithms. 
• Provides high level APIs in 
• Java 
• Scala 
• Python 
• Integration with Hadoop and its eco-system and can read existing data. 
• http://spark.apache.org/
Spark Stack 
• Spark SQL 
– For SQL and unstructured data 
processing 
• MLib 
– Machine Learning Algorithms 
• GraphX 
– Graph Processing 
• Spark Streaming 
– stream processing of live data 
streams 
http://spark.apache.org
Execution Flow 
http://spark.apache.org/docs/latest/cluster-overview.html
Terminology 
• Application Jar 
– User Program and its dependencies except Hadoop & Spark Jars bundled into a 
Jar file 
• Driver Program 
– The process to start the execution (main() function) 
• Cluster Manager 
– An external service to manage resources on the cluster (standalone manager, 
YARN, Apache Mesos) 
• Deploy Mode 
– cluster : Driver inside the cluster 
– client : Driver outside of Cluster
Terminology (contd.) 
• Worker Node : Node that run the application program in cluster 
• Executor 
– Process launched on a worker node, that runs the Tasks 
– Keep data in memory or disk storage 
• Task : A unit of work that will be sent to executor 
• Job 
– Consists multiple tasks 
– Created based on a Action 
• Stage : Each Job is divided into smaller set of tasks called Stages that is sequential 
and depend on each other 
• SparkContext : 
– represents the connection to a Spark cluster, and can be used to create RDDs, 
accumulators and broadcast variables on that cluster.
Resilient Distributed Dataset (RDD) 
• Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark 
• Immutable, Partitioned collection of elements that can be operated in parallel 
• Basic Operations 
– map 
– filter 
– persist 
• Multiple Implementation 
– PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join 
– DoubleRDDFunctions : Operation related to double values 
– SequenceFileRDDFunctions : Operation related to SequenceFiles 
• RDD main characteristics: 
– A list of partitions 
– A function for computing each split 
– A list of dependencies on other RDDs 
– Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 
– Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) 
• Custom RDD can be also implemented (by overriding functions)
Cluster Deployment 
• Standalone Deploy Mode 
– simplest way to deploy Spark on a private cluster 
• Amazon EC2 
– EC2 scripts are available 
– Very quick launching a new cluster 
• Apache Mesos 
• Hadoop YARN
Monitoring
Monitoring – Stages
Monitoring – Stages
Let’s try some examples…
Spark Shell 
./bin/spark-shell --master local[2] 
The --master option specifies the master URL for a distributed cluster, or local to run 
locally with one thread, or local[N] to run locally with N threads. You should start by 
using local for testing. 
./bin/run-example SparkPi 10 
This will run 10 iterations to calculate the value of Pi
Basic operations… 
scala> val textFile = sc.textFile("README.md") 
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 
scala> textFile.count() // Number of items in this RDD 
ees0: Long = 126 
scala> textFile.first() // First item in this RDD 
res1: String = # Apache Spark 
scala> val linesWithSpark = textFile.filter(line => 
line.contains("Spark")) 
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 
Simplier - Single liner: 
scala> textFile.filter(line => line.contains("Spark")).count() 
// How many lines contain "Spark"? 
res3: Long = 15
Map - Reduce 
scala> textFile.map(line => line.split(" ").size).reduce((a, b) 
=> if (a > b) a else b) 
res4: Long = 15 
scala> import java.lang.Math 
scala> textFile.map(line => line.split(" ").size).reduce((a, b) 
=> Math.max(a, b)) 
res5: Int = 15 
scala> val wordCounts = textFile.flatMap(line => line.split(" 
")).map(word => (word, 1)).reduceByKey((a, b) => a + b) 
wordCounts: spark.RDD[(String, Int)] = 
spark.ShuffledAggregatedRDD@71f027b8 
wordCounts.collect()
With Caching… 
scala> linesWithSpark.cache() 
res7: spark.RDD[String] = spark.FilteredRDD@17e51082 
scala> linesWithSpark.count() 
res8: Long = 15 
scala> linesWithSpark.count() 
res9: Long = 15
With HDFS… 
val lines = spark.textFile(“hdfs://...”) 
val errors = lines.filter(line => line.startsWith(“ERROR”)) 
println(Total errors: + errors.count())
Standalone (Scala) 
/* SimpleApp.scala */ 
import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 
object SimpleApp { 
def main(args: Array[String]) { 
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your 
system 
val conf = new SparkConf().setAppName("Simple Application") 
.setMaster(“local") 
val sc = new SparkContext(conf) 
val logData = sc.textFile(logFile, 2).cache() 
val numAs = logData.filter(line => line.contains("a")).count() 
val numBs = logData.filter(line => line.contains("b")).count() 
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) 
} 
}
Standalone (Java) 
/* SimpleApp.java */ 
import org.apache.spark.api.java.*; 
import org.apache.spark.SparkConf; 
import org.apache.spark.api.java.function.Function; 
public class SimpleApp { 
public static void main(String[] args) { 
String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system 
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); 
JavaSparkContext sc = new JavaSparkContext(conf); 
JavaRDD<String> logData = sc.textFile(logFile).cache(); 
long numAs = logData.filter(new Function<String, Boolean>() { 
public Boolean call(String s) { return s.contains("a"); } 
}).count(); 
long numBs = logData.filter(new Function<String, Boolean>() { 
public Boolean call(String s) { return s.contains("b"); } 
}).count(); 
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); 
} 
}
Standalone (Python) 
"""SimpleApp.py""" 
from pyspark import SparkContext 
logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your 
system 
sc = SparkContext("local", "Simple App") 
logData = sc.textFile(logFile).cache() 
numAs = logData.filter(lambda s: 'a' in s).count() 
numBs = logData.filter(lambda s: 'b' in s).count() 
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
Job Submission 
$SPARK_HOME/bin/spark-submit  
--class "SimpleApp"  
--master local[4]  
target/scala-2.10/simple-project_2.10-1.0.jar
Configuration 
val conf = new SparkConf() 
.setMaster("local") 
.setAppName("CountingSheep") 
.set("spark.executor.memory", "1g") 
val sc = new SparkContext(conf)
Questions ? 
26
Thanks! 
@rahuldausa on twitter and slideshare 
http://www.linkedin.com/in/rahuldausa 
27 
Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR 
http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ 
http://www.meetup.com/DataAnalyticsGroup/ 
Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies. 
http://www.meetup.com/Big-Data-Hyderabad/

More Related Content

What's hot

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 

What's hot (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Spark
SparkSpark
Spark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Spark
SparkSpark
Spark
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Viewers also liked

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 

Viewers also liked (7)

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 

Similar to Introduction to Apache Spark

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into SparkAshish kumar
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkMuktadiur Rahman
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkJUGBD
 

Similar to Introduction to Apache Spark (20)

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark 101
Spark 101Spark 101
Spark 101
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 

More from Rahul Jain

Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationRahul Jain
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to ScalaRahul Jain
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremRahul Jain
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginnersRahul Jain
 

More from Rahul Jain (14)

Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and Recommendation
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Introduction to Apache Spark

  • 1. Apache Spark In-Memory Data Processing September 2014 Meetup Organized by Big Data Hyderabad Meetup Group. http://www.meetup.com/Big-Data-Hyderabad/ Rahul Jain @rahuldausa
  • 2. Agenda • Why Spark • Introduction • Basics • Hands-on – Installation – Examples 2
  • 3. Quick Questionnaire How many people know/work on Scala ? How many people know/work on Python ? How many people know/heard/are using Spark ?
  • 4. Why Spark ? • Most of Machine Learning Algorithms are iterative because each iteration can improve the results • With Disk based approach each iteration’s output is written to disk making it slow Hadoop execution flow Spark execution flow http://www.wiziq.com/blog/hype-around-apache-spark/
  • 5. About Apache Spark • Initially started at UC Berkeley in 2009 • Fast and general purpose cluster computing system • 10x (on disk) - 100x (In-Memory) faster • Most popular for running Iterative Machine Learning Algorithms. • Provides high level APIs in • Java • Scala • Python • Integration with Hadoop and its eco-system and can read existing data. • http://spark.apache.org/
  • 6. Spark Stack • Spark SQL – For SQL and unstructured data processing • MLib – Machine Learning Algorithms • GraphX – Graph Processing • Spark Streaming – stream processing of live data streams http://spark.apache.org
  • 8. Terminology • Application Jar – User Program and its dependencies except Hadoop & Spark Jars bundled into a Jar file • Driver Program – The process to start the execution (main() function) • Cluster Manager – An external service to manage resources on the cluster (standalone manager, YARN, Apache Mesos) • Deploy Mode – cluster : Driver inside the cluster – client : Driver outside of Cluster
  • 9. Terminology (contd.) • Worker Node : Node that run the application program in cluster • Executor – Process launched on a worker node, that runs the Tasks – Keep data in memory or disk storage • Task : A unit of work that will be sent to executor • Job – Consists multiple tasks – Created based on a Action • Stage : Each Job is divided into smaller set of tasks called Stages that is sequential and depend on each other • SparkContext : – represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
  • 10. Resilient Distributed Dataset (RDD) • Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark • Immutable, Partitioned collection of elements that can be operated in parallel • Basic Operations – map – filter – persist • Multiple Implementation – PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join – DoubleRDDFunctions : Operation related to double values – SequenceFileRDDFunctions : Operation related to SequenceFiles • RDD main characteristics: – A list of partitions – A function for computing each split – A list of dependencies on other RDDs – Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) – Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) • Custom RDD can be also implemented (by overriding functions)
  • 11. Cluster Deployment • Standalone Deploy Mode – simplest way to deploy Spark on a private cluster • Amazon EC2 – EC2 scripts are available – Very quick launching a new cluster • Apache Mesos • Hadoop YARN
  • 15. Let’s try some examples…
  • 16. Spark Shell ./bin/spark-shell --master local[2] The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. ./bin/run-example SparkPi 10 This will run 10 iterations to calculate the value of Pi
  • 17. Basic operations… scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() // Number of items in this RDD ees0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 Simplier - Single liner: scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15
  • 18. Map - Reduce scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res4: Long = 15 scala> import java.lang.Math scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res5: Int = 15 scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8 wordCounts.collect()
  • 19. With Caching… scala> linesWithSpark.cache() res7: spark.RDD[String] = spark.FilteredRDD@17e51082 scala> linesWithSpark.count() res8: Long = 15 scala> linesWithSpark.count() res9: Long = 15
  • 20. With HDFS… val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(line => line.startsWith(“ERROR”)) println(Total errors: + errors.count())
  • 21. Standalone (Scala) /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") .setMaster(“local") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } }
  • 22. Standalone (Java) /* SimpleApp.java */ import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; public class SimpleApp { public static void main(String[] args) { String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(logFile).cache(); long numAs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("a"); } }).count(); long numBs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); } }
  • 23. Standalone (Python) """SimpleApp.py""" from pyspark import SparkContext logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system sc = SparkContext("local", "Simple App") logData = sc.textFile(logFile).cache() numAs = logData.filter(lambda s: 'a' in s).count() numBs = logData.filter(lambda s: 'b' in s).count() print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
  • 24. Job Submission $SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
  • 25. Configuration val conf = new SparkConf() .setMaster("local") .setAppName("CountingSheep") .set("spark.executor.memory", "1g") val sc = new SparkContext(conf)
  • 27. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa 27 Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ http://www.meetup.com/DataAnalyticsGroup/ Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies. http://www.meetup.com/Big-Data-Hyderabad/