Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
C H A P T E R 0 1 : I N T R O D U C T I O N T O D A T A
A N A L Y S I S W I T H S P A R K
Learning Spark
by Holden Karau e...
Overview: Introduction to Data Analysis with
SPARK
 What Is Apache Spark?
 A Unified Stack
 Spark Core
 Spark SQL
 Sp...
1.1 What Is Apache Spark?
 Apache Spark is a cluster computing platform
 Spark extends MapReduce model to support
 Diff...
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Prin...
1.2 A Unified Stack
1.2.1 A Unified Stack: Core, SQL, Streaming
 Spark Core
 Task Scheduling
 Memory management
 Fault recovery
 Storage ...
1.2.2 A Unified Stack: MLlib, GraphX, ClusterM
 MLlib
 Contains common machine learning (ML) modules
 Classification, R...
1.3 Who Uses Spark, and for What ?
 General-purpose framework for cluster computing
 Data Scientists
 Engineers
 Data ...
1.4 A Brief History of Spark
 2009: UC Berkeley RAD lab became AMPlab
 Start with Hadoop MapReduce was inefficient for i...
1.5 Spark Versions and Releases
 May 2014 Spark 1.1.0
 April 2015 Spark 1.3.1
 Spark Documentation
1.6 Storage Layers for Spark
 Spark can create distributed datasets from
 HDFS
 Supported by Hadoop API
 Local Filesys...
Learn More about Apache Spark
Upcoming SlideShare
Loading in …5
×

Learning spark ch01 - Introduction to Data Analysis with Spark

609 views

Published on

Learning spark ch01 - Introduction to Data Analysis with Spark
References to Spark Course

Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv

Published in: Education
  • Login to see the comments

Learning spark ch01 - Introduction to Data Analysis with Spark

  1. 1. C H A P T E R 0 1 : I N T R O D U C T I O N T O D A T A A N A L Y S I S W I T H S P A R K Learning Spark by Holden Karau et. al.
  2. 2. Overview: Introduction to Data Analysis with SPARK  What Is Apache Spark?  A Unified Stack  Spark Core  Spark SQL  Spark Streaming  MLlib  GraphX  Cluster Managers  Who Uses Spark, and for What?  Data Science Tasks  Data Processing Applications  A Brief History of Spark  Spark Versions and Releases  Storage Layers for Spark
  3. 3. 1.1 What Is Apache Spark?  Apache Spark is a cluster computing platform  Spark extends MapReduce model to support  Different computations  batch applications,  iterative algorithms,  interactive queries,  and streaming  Run computations in memory  Highly Accessible  simple APIs in Python, Java, Scala, and SQL  rich built-in libraries accessing Hadoop Clusters/Data Sources
  4. 4. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  5. 5. 1.2 A Unified Stack
  6. 6. 1.2.1 A Unified Stack: Core, SQL, Streaming  Spark Core  Task Scheduling  Memory management  Fault recovery  Storage system interaction  API that defines resilient Distributed Dataset (RDD)  Spark SQL  Provide SQL interface to Spark  Allow programmatic data manipulations mix with SQL  Spark Streaming  Enables processing of live stream data e.g. web logs
  7. 7. 1.2.2 A Unified Stack: MLlib, GraphX, ClusterM  MLlib  Contains common machine learning (ML) modules  Classification, Regression, Clustering, Collaborative Filtering  Model evaluation, Data Import, Lower-level ML primitives  GraphX  Extends Spark RDD APIs just like Spark SQL/Streaming  Contains graph algorithms  Cluster Managers  Hadoop YARN, Apache Mesos  Default: Standalone scheduler
  8. 8. 1.3 Who Uses Spark, and for What ?  General-purpose framework for cluster computing  Data Scientists  Engineers  Data Scientists  Analyze and Model data  SQL, Statistics, Predictive Model (ML) using Python, R  Use Cases: Interactive shells with Python, Scala, SparkSQL supporting MLlib libraries calling out Matlab/R  Engineers  Data Processing Applications  Principles of SW engineering (Encapsulation, OOP, Interface design)
  9. 9. 1.4 A Brief History of Spark  2009: UC Berkeley RAD lab became AMPlab  Start with Hadoop MapReduce was inefficient for interactive computing jobs  designed for interactive and iterative query performance  In-memory storage  Efficient fault recovery 10-20X times faster than MapReduce  Early Adopters  Spark PoweredBy page  Spark Meetups  Spark Summit  2011  Berkeley Data Analytics Stacks (BDAS)
  10. 10. 1.5 Spark Versions and Releases  May 2014 Spark 1.1.0  April 2015 Spark 1.3.1  Spark Documentation
  11. 11. 1.6 Storage Layers for Spark  Spark can create distributed datasets from  HDFS  Supported by Hadoop API  Local Filesystem  Amazon S3  Cassandra  Hive  Hbase …etc  Supports others  Text file  Sequence file  Arvo  Parquet  Hadoop InputFormat
  12. 12. Learn More about Apache Spark

×