Learning spark ch01 - Introduction to Data Analysis with Spark
References to Spark Course
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
Learning spark ch01 - Introduction to Data Analysis with Spark
1. C H A P T E R 0 1 : I N T R O D U C T I O N T O D A T A
A N A L Y S I S W I T H S P A R K
Learning Spark
by Holden Karau et. al.
2. Overview: Introduction to Data Analysis with
SPARK
What Is Apache Spark?
A Unified Stack
Spark Core
Spark SQL
Spark Streaming
MLlib
GraphX
Cluster Managers
Who Uses Spark, and for What?
Data Science Tasks
Data Processing Applications
A Brief History of Spark
Spark Versions and Releases
Storage Layers for Spark
3. 1.1 What Is Apache Spark?
Apache Spark is a cluster computing platform
Spark extends MapReduce model to support
Different computations
batch applications,
iterative algorithms,
interactive queries,
and streaming
Run computations in memory
Highly Accessible
simple APIs in Python, Java, Scala, and SQL
rich built-in libraries accessing Hadoop Clusters/Data Sources
4. Edx and Coursera Courses
Introduction to Big Data with Apache Spark
Spark Fundamentals I
Functional Programming Principles in Scala
6. 1.2.1 A Unified Stack: Core, SQL, Streaming
Spark Core
Task Scheduling
Memory management
Fault recovery
Storage system interaction
API that defines resilient Distributed Dataset (RDD)
Spark SQL
Provide SQL interface to Spark
Allow programmatic data manipulations mix with SQL
Spark Streaming
Enables processing of live stream data e.g. web logs
7. 1.2.2 A Unified Stack: MLlib, GraphX, ClusterM
MLlib
Contains common machine learning (ML) modules
Classification, Regression, Clustering, Collaborative Filtering
Model evaluation, Data Import, Lower-level ML primitives
GraphX
Extends Spark RDD APIs just like Spark SQL/Streaming
Contains graph algorithms
Cluster Managers
Hadoop YARN, Apache Mesos
Default: Standalone scheduler
8. 1.3 Who Uses Spark, and for What ?
General-purpose framework for cluster computing
Data Scientists
Engineers
Data Scientists
Analyze and Model data
SQL, Statistics, Predictive Model (ML) using Python, R
Use Cases: Interactive shells with Python, Scala, SparkSQL
supporting MLlib libraries calling out Matlab/R
Engineers
Data Processing Applications
Principles of SW engineering (Encapsulation, OOP, Interface
design)
9. 1.4 A Brief History of Spark
2009: UC Berkeley RAD lab became AMPlab
Start with Hadoop MapReduce was inefficient for interactive
computing jobs designed for interactive and iterative query
performance
In-memory storage
Efficient fault recovery 10-20X times faster than MapReduce
Early Adopters
Spark PoweredBy page
Spark Meetups
Spark Summit
2011
Berkeley Data Analytics Stacks (BDAS)
10. 1.5 Spark Versions and Releases
May 2014 Spark 1.1.0
April 2015 Spark 1.3.1
Spark Documentation
11. 1.6 Storage Layers for Spark
Spark can create distributed datasets from
HDFS
Supported by Hadoop API
Local Filesystem
Amazon S3
Cassandra
Hive
Hbase …etc
Supports others
Text file
Sequence file
Arvo
Parquet
Hadoop InputFormat
Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster.
First, all libraries and higher- level components in the stack benefit from improvements at the lower layers.
Second, the costs associated with running the stack are minimized, because instead of running 5–10 independent software systems, an organization needs to run only one.
Finally, one of the largest advantages of tight integration is the ability to build appli‐ cations that seamlessly combine different processing models.