Learning spark ch01 - Introduction to Data Analysis with Spark

C H A P T E R 0 1 : I N T R O D U C T I O N T O D A T A
A N A L Y S I S W I T H S P A R K
Learning Spark
by Holden Karau et. al.

Overview: Introduction to Data Analysis with
SPARK
 What Is Apache Spark?
 A Unified Stack
 Spark Core
 Spark SQL
 Spark Streaming
 MLlib
 GraphX
 Cluster Managers
 Who Uses Spark, and for What?
 Data Science Tasks
 Data Processing Applications
 A Brief History of Spark
 Spark Versions and Releases
 Storage Layers for Spark

1.1 What Is Apache Spark?
 Apache Spark is a cluster computing platform
 Spark extends MapReduce model to support
 Different computations
 batch applications,
 iterative algorithms,
 interactive queries,
 and streaming
 Run computations in memory
 Highly Accessible
 simple APIs in Python, Java, Scala, and SQL
 rich built-in libraries accessing Hadoop Clusters/Data Sources

Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala

1.2.1 A Unified Stack: Core, SQL, Streaming
 Spark Core
 Task Scheduling
 Memory management
 Fault recovery
 Storage system interaction
 API that defines resilient Distributed Dataset (RDD)
 Spark SQL
 Provide SQL interface to Spark
 Allow programmatic data manipulations mix with SQL
 Spark Streaming
 Enables processing of live stream data e.g. web logs

1.2.2 A Unified Stack: MLlib, GraphX, ClusterM
 MLlib
 Contains common machine learning (ML) modules
 Classification, Regression, Clustering, Collaborative Filtering
 Model evaluation, Data Import, Lower-level ML primitives
 GraphX
 Extends Spark RDD APIs just like Spark SQL/Streaming
 Contains graph algorithms
 Cluster Managers
 Hadoop YARN, Apache Mesos
 Default: Standalone scheduler

1.3 Who Uses Spark, and for What ?
 General-purpose framework for cluster computing
 Data Scientists
 Engineers
 Data Scientists
 Analyze and Model data
 SQL, Statistics, Predictive Model (ML) using Python, R
 Use Cases: Interactive shells with Python, Scala, SparkSQL
supporting MLlib libraries calling out Matlab/R
 Engineers
 Data Processing Applications
 Principles of SW engineering (Encapsulation, OOP, Interface
design)

1.4 A Brief History of Spark
 2009: UC Berkeley RAD lab became AMPlab
 Start with Hadoop MapReduce was inefficient for interactive
computing jobs  designed for interactive and iterative query
performance
 In-memory storage
 Efficient fault recovery 10-20X times faster than MapReduce
 Early Adopters
 Spark PoweredBy page
 Spark Meetups
 Spark Summit
 2011
 Berkeley Data Analytics Stacks (BDAS)

1.5 Spark Versions and Releases
 May 2014 Spark 1.1.0
 April 2015 Spark 1.3.1
 Spark Documentation

1.6 Storage Layers for Spark
 Spark can create distributed datasets from
 HDFS
 Supported by Hadoop API
 Local Filesystem
 Amazon S3
 Cassandra
 Hive
 Hbase …etc
 Supports others
 Text file
 Sequence file
 Arvo
 Parquet
 Hadoop InputFormat

Learning spark ch01 - Introduction to Data Analysis with Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Learning spark ch01 - Introduction to Data Analysis with Spark

Similar to Learning spark ch01 - Introduction to Data Analysis with Spark (20)

More from phanleson

More from phanleson (20)

Recently uploaded

Recently uploaded (20)

Learning spark ch01 - Introduction to Data Analysis with Spark

Editor's Notes