SlideShare a Scribd company logo
1 of 28
In Memory Analytics-
Apache Spark
Ravi
Agenda
 Overview of Spark
 Spark with Hadoop MapReduce
 Spark Elements and Operations
 Spark Cluster Overview
 Spark Examples
 Spark Stack Extensions:
 Shark
 Streaming
 Mlib
 Graphx
In Memory Analytics
• In-memory analytics is an approach to querying data when it resides in a
computer’s random access memory (RAM), as opposed to querying data
that is stored on physical disks.
• This results in vastly shortened query response times, allowing business
intelligence (BI) and analytic applications to support faster business
decisions.
• As the cost of RAM declines, in-memory analytics is becoming feasible
for many businesses.
• BI and analytic applications have long supported caching data in RAM, but
older 32-bit operating systems provided only 4 GB of addressable memory.
• Newer 64-bit operating systems, with up to 1 terabyte (TB) addressable
memory (and perhaps more in the future), have made it possible to cache
large volumes of data -- potentially an entire data warehouse or data mart --
in a computer’s RAM.
 Not a modified version of Hadoop
 Separate, fast, Map-Reduce-like engine
 In-memory data storage for very fast iterative queries
 Generate execution of graphs and powerful optimizations
 Up to 40x faster than Hadoop
 Spark beats Hadoop by providing primitives for in-memory cluster
computing; thereby avoiding the I/O bottleneck between the individual
jobs of an iterative MapReduce workflow that repeatedly performs
computations on the same working set.
 Compatible with Hadoop’s storage APIs
 Can read/write to any Hadoop-supported systems, including
HDFS, Hbase, SequenceFiles, etc
What is Spark
- Lightning-Fast Cluster Computing
Quick Recap Hadoop Eco System
Spark Programming Model
 Key idea : Resilient Distributed Data (RDD)
 Distributed collections of objects that can be cached in memory across cluster nodes
 Manipulated through various parallel operations
 Automatically rebuilt on failures
 Types of RDD:
 Parallelized collections: Take an existing Scala collection and run functions on it in
parallel
 scala> val distData = sc.parallelize(data)
 distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
 Hadoop datasets : Run functions on each record of a file in Hadoop distributed file
system or any other storage system supported by Hadoop
 scala> val distFile = sc.textFile("data.txt")
 distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
For example, consider the following job:
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)
Automatic Parallelization of Complex Flows
 When constructing a complex pipeline of
MapReduce jobs, the task of correctly
parallelizing the sequence of jobs is left to
you. Thus, a scheduler tool such as
Apache Oozie is often required to
carefully construct this sequence.
 With Spark, a whole series of individual
tasks is expressed as a single program
flow that is lazily evaluated so that the
system has a complete picture of the
execution graph.
 This approach allows the core scheduler
to correctly map the dependencies
across different stages in the
application, and automatically parallelize
the flow of operators without user
intervention.
Spark vs Hadoop
Spark is a high-speed cluster computing system compatible with Hadoop that
can outperform it by up to 100 times considering its ability to perform
computations in memory
Transformations (eg: map, filter, group by) :
Create a new dataset from an existing one
Actions ( eg: count, collect, save) :
Return a value to the driver program after running a computation
on the dataset
Spark Elements
 Application User program built on Spark. Consists of a driver program and executors on the
cluster.
 Driver program The process running the main() function of the application and creating the
SparkContext
 Cluster manager An external service for acquiring resources on the cluster (e.g. standalone
manager, Mesos, YARN)
 Worker node Any node that can run application code in the cluster
 Executor A process launched for an application on a worker node, that runs tasks and
keeps data in memory or disk storage across them. Each application has its own executors.
 Task A unit of work that will be sent to one executor
 Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark
action (e.g. save, collect); you'll see this term used in the driver's logs.
 Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
Spark Cluster Overview
Cluster Manager Types
• Standalone – a simple cluster manager included with Spark that makes it
easy to set up a cluster.
• Apache Mesos – a general cluster manager that can also run Hadoop
MapReduce and service applications.
• Hadoop YARN – the resource manager in Hadoop 2.
Mesos (Dynamic Resource Sharing for
Clusters) Run Modes
 Spark can run over Mesos in two modes: “fine-grained” and “coarse-
grained”.
 Fine-grained mode, which is the default, each Spark task runs as a
separate Mesos task.
 This allows multiple instances of Spark (and other frameworks) to share machines at
a very fine granularity, where each application gets more or fewer machines as it
ramps up, but it comes with an additional overhead in launching each task, which
may be inappropriate for low-latency applications (e.g. interactive queries or serving
web requests).
 Coarse-grained mode will instead launch only one long-running Spark
task on each Mesos machine, and dynamically schedule its own “mini-
tasks” within it.
 The benefit is much lower startup overhead, but at the cost of reserving the Mesos
resources for the complete duration of the application.
Task Scheduler
• Runs general DAGs
• Pipelines functions within a
stage
• Cache-aware data reuse &
locality
• Partitioning-aware to avoid
shuffles
Spark Stack Extension
Spark powers a stack of high-level tools including
 Shark for SQL
 MLlib for machine learning
 GraphX
 Spark Streaming.
You can combine these frameworks seamlessly in the same
application.
Shark
Shark makes Hive faster and more powerful.
 Shark is a new data analysis system that marries query
processing with complex analytics on large clusters
 Shark is an open source distributed SQL query engine for
Hadoop data. It brings state-of-the-art performance and
advanced analytics to Hive users.
 Speed : Run Hive queries up to 100x faster in memory, or
10x on disk.
Streaming
Spark Streaming makes it easy to build scalable fault-tolerant
streaming applications.
 Spark Streaming brings Spark's language-integrated API to stream processing, letting
you write streaming applications the same way you write batch jobs.
 It supports both Java and Scala.
 Spark Streaming lets you reuse the same code for batch processing, join streams
against historical data, or run ad-hoc queries on stream state
 Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ.
 Since Spark Streaming is built on top of Spark, users can apply Spark's in-built
machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on
data streams
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
Counting tweets on a sliding window
stream.join(historicCounts).filter {
case (word, (curCount, oldCount)) =>
curCount > oldCount
}
Find words with higher frequency than
historic data
MLlib
MLlib is Apache Spark's scalable machine learning library.
 MLlib fits into Spark's APIs and interoperates with NumPy in
Python (starting in Spark 0.9). You can use any Hadoop data
source (e.g. HDFS, HBase, or local files), making it easy to plug
into Hadoop workflows.
points = spark.textFile("hdfs://...")
.map(parsePoint)
model = KMeans.train(points)
Calling MLlib in Scala
GraphX
Unifying Graphs and Tables
 GraphX extends the distributed fault-tolerant collections API and
interactive console of Spark with a new graph API which leverages
recent advances in graph systems (e.g., GraphLab) to enable
users to easily and interactively build, transform, and reason about
graph structured data at scale.
BDAS, the Berkeley Data
Analytics Stack,
https://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that
integrates software components being built by the AMPLab to make sense of Big Data.
Software and Research
Projects
 Shark - Hive and SQL on top of Spark
 MLbase - Machine Learning project on top of Spark
 BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark
 GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into
Spark 0.9)
 Apache Mesos - Cluster management system that supports running Spark
 Tachyon - In memory storage system that supports running Spark
 Apache MRQL - A query processing and optimization system for large-scale, distributed data
analysis, built on top of Apache Hadoop, Hama, and Spark
 OpenDL - A deep learning algorithm library based on Spark framework. Just kick off.
 SparkR - R frontend for Spark
 Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster
Conclusion
 “Bigdata” is moving beyond one-pass batch jobs, to
low-latency apps that need data sharing
 RDDs offer fault-tolerant sharing at memory speed
 Spark uses them to combine streaming, batch &
interactive analytics in one system

More Related Content

What's hot

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
03 spark rdd operations
03 spark rdd operations03 spark rdd operations
03 spark rdd operationsVenkat Datla
 

What's hot (20)

Spark
SparkSpark
Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
03 spark rdd operations
03 spark rdd operations03 spark rdd operations
03 spark rdd operations
 

Viewers also liked

What Is Visualization?
What Is Visualization?What Is Visualization?
What Is Visualization?OneSpring LLC
 
An Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical VisualizationAn Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical VisualizationNoeska Smit
 
Information Visualization for Medical Informatics
Information Visualization for Medical Informatics Information Visualization for Medical Informatics
Information Visualization for Medical Informatics University of Maryland
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Mia Yuan Cao
 
Theius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop ClustersTheius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop Clustersjtedesco5
 
Info vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanInfo vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanUniversity of Maryland
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...Jonas Traub
 
Web 2 0 Projects Elementary
Web 2 0 Projects ElementaryWeb 2 0 Projects Elementary
Web 2 0 Projects ElementaryCinci0987
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecTiago Henriques
 
Text and text stream mining tutorial
Text and text stream mining tutorialText and text stream mining tutorial
Text and text stream mining tutorialmgrcar
 
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12cProcessing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12cGuido Schmutz
 
Towards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information VisualizationTowards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information VisualizationNiklas Elmqvist
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan confluent
 

Viewers also liked (14)

What Is Visualization?
What Is Visualization?What Is Visualization?
What Is Visualization?
 
An Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical VisualizationAn Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical Visualization
 
Information Visualization for Medical Informatics
Information Visualization for Medical Informatics Information Visualization for Medical Informatics
Information Visualization for Medical Informatics
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
 
Theius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop ClustersTheius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop Clusters
 
Info vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanInfo vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneiderman
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
 
Web 2 0 Projects Elementary
Web 2 0 Projects ElementaryWeb 2 0 Projects Elementary
Web 2 0 Projects Elementary
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresec
 
Text and text stream mining tutorial
Text and text stream mining tutorialText and text stream mining tutorial
Text and text stream mining tutorial
 
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12cProcessing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
 
Towards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information VisualizationTowards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information Visualization
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
 

Similar to In Memory Analytics with Apache Spark

Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 

Similar to In Memory Analytics with Apache Spark (20)

APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark core
Spark coreSpark core
Spark core
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Apache spark
Apache sparkApache spark
Apache spark
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 

More from Venkata Naga Ravi (11)

Microservices with Docker
Microservices with Docker Microservices with Docker
Microservices with Docker
 
Quick Trip with Docker
Quick Trip with DockerQuick Trip with Docker
Quick Trip with Docker
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Flocker
FlockerFlocker
Flocker
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 
Go Lang
Go LangGo Lang
Go Lang
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
NoSQL & HBase overview
NoSQL & HBase overviewNoSQL & HBase overview
NoSQL & HBase overview
 
Software Defined Network - SDN
Software Defined Network - SDNSoftware Defined Network - SDN
Software Defined Network - SDN
 
Virtual Container - Docker
Virtual Container - Docker Virtual Container - Docker
Virtual Container - Docker
 
Java 8 Lambda and Streams
Java 8 Lambda and StreamsJava 8 Lambda and Streams
Java 8 Lambda and Streams
 

Recently uploaded

Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 

Recently uploaded (20)

Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 

In Memory Analytics with Apache Spark

  • 2. Agenda  Overview of Spark  Spark with Hadoop MapReduce  Spark Elements and Operations  Spark Cluster Overview  Spark Examples  Spark Stack Extensions:  Shark  Streaming  Mlib  Graphx
  • 3. In Memory Analytics • In-memory analytics is an approach to querying data when it resides in a computer’s random access memory (RAM), as opposed to querying data that is stored on physical disks. • This results in vastly shortened query response times, allowing business intelligence (BI) and analytic applications to support faster business decisions. • As the cost of RAM declines, in-memory analytics is becoming feasible for many businesses. • BI and analytic applications have long supported caching data in RAM, but older 32-bit operating systems provided only 4 GB of addressable memory. • Newer 64-bit operating systems, with up to 1 terabyte (TB) addressable memory (and perhaps more in the future), have made it possible to cache large volumes of data -- potentially an entire data warehouse or data mart -- in a computer’s RAM.
  • 4.  Not a modified version of Hadoop  Separate, fast, Map-Reduce-like engine  In-memory data storage for very fast iterative queries  Generate execution of graphs and powerful optimizations  Up to 40x faster than Hadoop  Spark beats Hadoop by providing primitives for in-memory cluster computing; thereby avoiding the I/O bottleneck between the individual jobs of an iterative MapReduce workflow that repeatedly performs computations on the same working set.  Compatible with Hadoop’s storage APIs  Can read/write to any Hadoop-supported systems, including HDFS, Hbase, SequenceFiles, etc What is Spark - Lightning-Fast Cluster Computing
  • 5. Quick Recap Hadoop Eco System
  • 6.
  • 7.
  • 8. Spark Programming Model  Key idea : Resilient Distributed Data (RDD)  Distributed collections of objects that can be cached in memory across cluster nodes  Manipulated through various parallel operations  Automatically rebuilt on failures  Types of RDD:  Parallelized collections: Take an existing Scala collection and run functions on it in parallel  scala> val distData = sc.parallelize(data)  distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e  Hadoop datasets : Run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop  scala> val distFile = sc.textFile("data.txt")  distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
  • 9.
  • 10.
  • 11. For example, consider the following job: rdd1.map(splitlines).filter("ERROR") rdd2.map(splitlines).groupBy(key) rdd2.join(rdd1, key).take(10) Automatic Parallelization of Complex Flows  When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence.  With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph.  This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention.
  • 12. Spark vs Hadoop Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory
  • 13. Transformations (eg: map, filter, group by) : Create a new dataset from an existing one Actions ( eg: count, collect, save) : Return a value to the driver program after running a computation on the dataset
  • 14. Spark Elements  Application User program built on Spark. Consists of a driver program and executors on the cluster.  Driver program The process running the main() function of the application and creating the SparkContext  Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)  Worker node Any node that can run application code in the cluster  Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.  Task A unit of work that will be sent to one executor  Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.  Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
  • 15. Spark Cluster Overview Cluster Manager Types • Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster. • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications. • Hadoop YARN – the resource manager in Hadoop 2.
  • 16. Mesos (Dynamic Resource Sharing for Clusters) Run Modes  Spark can run over Mesos in two modes: “fine-grained” and “coarse- grained”.  Fine-grained mode, which is the default, each Spark task runs as a separate Mesos task.  This allows multiple instances of Spark (and other frameworks) to share machines at a very fine granularity, where each application gets more or fewer machines as it ramps up, but it comes with an additional overhead in launching each task, which may be inappropriate for low-latency applications (e.g. interactive queries or serving web requests).  Coarse-grained mode will instead launch only one long-running Spark task on each Mesos machine, and dynamically schedule its own “mini- tasks” within it.  The benefit is much lower startup overhead, but at the cost of reserving the Mesos resources for the complete duration of the application.
  • 17.
  • 18. Task Scheduler • Runs general DAGs • Pipelines functions within a stage • Cache-aware data reuse & locality • Partitioning-aware to avoid shuffles
  • 19. Spark Stack Extension Spark powers a stack of high-level tools including  Shark for SQL  MLlib for machine learning  GraphX  Spark Streaming. You can combine these frameworks seamlessly in the same application.
  • 20.
  • 21. Shark Shark makes Hive faster and more powerful.  Shark is a new data analysis system that marries query processing with complex analytics on large clusters  Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users.  Speed : Run Hive queries up to 100x faster in memory, or 10x on disk.
  • 22.
  • 23. Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.  Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs.  It supports both Java and Scala.  Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state  Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ.  Since Spark Streaming is built on top of Spark, users can apply Spark's in-built machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on data streams TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5)) Counting tweets on a sliding window stream.join(historicCounts).filter { case (word, (curCount, oldCount)) => curCount > oldCount } Find words with higher frequency than historic data
  • 24. MLlib MLlib is Apache Spark's scalable machine learning library.  MLlib fits into Spark's APIs and interoperates with NumPy in Python (starting in Spark 0.9). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows. points = spark.textFile("hdfs://...") .map(parsePoint) model = KMeans.train(points) Calling MLlib in Scala
  • 25. GraphX Unifying Graphs and Tables  GraphX extends the distributed fault-tolerant collections API and interactive console of Spark with a new graph API which leverages recent advances in graph systems (e.g., GraphLab) to enable users to easily and interactively build, transform, and reason about graph structured data at scale.
  • 26. BDAS, the Berkeley Data Analytics Stack, https://amplab.cs.berkeley.edu/software/ BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.
  • 27. Software and Research Projects  Shark - Hive and SQL on top of Spark  MLbase - Machine Learning project on top of Spark  BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark  GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into Spark 0.9)  Apache Mesos - Cluster management system that supports running Spark  Tachyon - In memory storage system that supports running Spark  Apache MRQL - A query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark  OpenDL - A deep learning algorithm library based on Spark framework. Just kick off.  SparkR - R frontend for Spark  Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster
  • 28. Conclusion  “Bigdata” is moving beyond one-pass batch jobs, to low-latency apps that need data sharing  RDDs offer fault-tolerant sharing at memory speed  Spark uses them to combine streaming, batch & interactive analytics in one system

Editor's Notes

  1. http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/