SlideShare a Scribd company logo
1 of 12
C H A P T E R 0 1 : I N T R O D U C T I O N T O D A T A
A N A L Y S I S W I T H S P A R K
Learning Spark
by Holden Karau et. al.
Overview: Introduction to Data Analysis with
SPARK
 What Is Apache Spark?
 A Unified Stack
 Spark Core
 Spark SQL
 Spark Streaming
 MLlib
 GraphX
 Cluster Managers
 Who Uses Spark, and for What?
 Data Science Tasks
 Data Processing Applications
 A Brief History of Spark
 Spark Versions and Releases
 Storage Layers for Spark
1.1 What Is Apache Spark?
 Apache Spark is a cluster computing platform
 Spark extends MapReduce model to support
 Different computations
 batch applications,
 iterative algorithms,
 interactive queries,
 and streaming
 Run computations in memory
 Highly Accessible
 simple APIs in Python, Java, Scala, and SQL
 rich built-in libraries accessing Hadoop Clusters/Data Sources
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
1.2 A Unified Stack
1.2.1 A Unified Stack: Core, SQL, Streaming
 Spark Core
 Task Scheduling
 Memory management
 Fault recovery
 Storage system interaction
 API that defines resilient Distributed Dataset (RDD)
 Spark SQL
 Provide SQL interface to Spark
 Allow programmatic data manipulations mix with SQL
 Spark Streaming
 Enables processing of live stream data e.g. web logs
1.2.2 A Unified Stack: MLlib, GraphX, ClusterM
 MLlib
 Contains common machine learning (ML) modules
 Classification, Regression, Clustering, Collaborative Filtering
 Model evaluation, Data Import, Lower-level ML primitives
 GraphX
 Extends Spark RDD APIs just like Spark SQL/Streaming
 Contains graph algorithms
 Cluster Managers
 Hadoop YARN, Apache Mesos
 Default: Standalone scheduler
1.3 Who Uses Spark, and for What ?
 General-purpose framework for cluster computing
 Data Scientists
 Engineers
 Data Scientists
 Analyze and Model data
 SQL, Statistics, Predictive Model (ML) using Python, R
 Use Cases: Interactive shells with Python, Scala, SparkSQL
supporting MLlib libraries calling out Matlab/R
 Engineers
 Data Processing Applications
 Principles of SW engineering (Encapsulation, OOP, Interface
design)
1.4 A Brief History of Spark
 2009: UC Berkeley RAD lab became AMPlab
 Start with Hadoop MapReduce was inefficient for interactive
computing jobs  designed for interactive and iterative query
performance
 In-memory storage
 Efficient fault recovery 10-20X times faster than MapReduce
 Early Adopters
 Spark PoweredBy page
 Spark Meetups
 Spark Summit
 2011
 Berkeley Data Analytics Stacks (BDAS)
1.5 Spark Versions and Releases
 May 2014 Spark 1.1.0
 April 2015 Spark 1.3.1
 Spark Documentation
1.6 Storage Layers for Spark
 Spark can create distributed datasets from
 HDFS
 Supported by Hadoop API
 Local Filesystem
 Amazon S3
 Cassandra
 Hive
 Hbase …etc
 Supports others
 Text file
 Sequence file
 Arvo
 Parquet
 Hadoop InputFormat
Learn More about Apache Spark

More Related Content

What's hot

32 dynamic linking nd overlays
32 dynamic linking nd overlays32 dynamic linking nd overlays
32 dynamic linking nd overlays
myrajendra
 
Process management in os
Process management in osProcess management in os
Process management in os
Miong Lazaro
 

What's hot (20)

Python basic
Python basicPython basic
Python basic
 
Peephole optimization techniques in compiler design
Peephole optimization techniques in compiler designPeephole optimization techniques in compiler design
Peephole optimization techniques in compiler design
 
Debuggers in system software
Debuggers in system softwareDebuggers in system software
Debuggers in system software
 
Compiler Construction introduction
Compiler Construction introductionCompiler Construction introduction
Compiler Construction introduction
 
Loader and Its types
Loader and Its typesLoader and Its types
Loader and Its types
 
32 dynamic linking nd overlays
32 dynamic linking nd overlays32 dynamic linking nd overlays
32 dynamic linking nd overlays
 
Unix ppt
Unix pptUnix ppt
Unix ppt
 
Seminar report On Python
Seminar report On PythonSeminar report On Python
Seminar report On Python
 
Batch operating system
Batch operating system Batch operating system
Batch operating system
 
System calls
System callsSystem calls
System calls
 
Introduction to Compiler design
Introduction to Compiler design Introduction to Compiler design
Introduction to Compiler design
 
Chapter 16
Chapter 16Chapter 16
Chapter 16
 
Scheduling in Cloud Computing
Scheduling in Cloud ComputingScheduling in Cloud Computing
Scheduling in Cloud Computing
 
Loaders ( system programming )
Loaders ( system programming ) Loaders ( system programming )
Loaders ( system programming )
 
Process management in os
Process management in osProcess management in os
Process management in os
 
Zero to Hero - Introduction to Python3
Zero to Hero - Introduction to Python3Zero to Hero - Introduction to Python3
Zero to Hero - Introduction to Python3
 
Python Tutorial Part 2
Python Tutorial Part 2Python Tutorial Part 2
Python Tutorial Part 2
 
Loaders
LoadersLoaders
Loaders
 
Unix ppt
Unix pptUnix ppt
Unix ppt
 
Cpu scheduling in operating System.
Cpu scheduling in operating System.Cpu scheduling in operating System.
Cpu scheduling in operating System.
 

Similar to Learning spark ch01 - Introduction to Data Analysis with Spark

Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 

Similar to Learning spark ch01 - Introduction to Data Analysis with Spark (20)

Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySpark
 
Apachespark 160612140708
Apachespark 160612140708Apachespark 160612140708
Apachespark 160612140708
 
Apache spark
Apache sparkApache spark
Apache spark
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Spark and scala course content | Spark and scala course online training
Spark and scala course content | Spark and scala course online trainingSpark and scala course content | Spark and scala course online training
Spark and scala course content | Spark and scala course online training
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 

More from phanleson

Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XML
phanleson
 

More from phanleson (20)

Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewalls
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hacking
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocols
 
E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacks
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applications
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBase
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XML
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Web
 
Lecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many PurposesLecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many Purposes
 

Recently uploaded

Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Recently uploaded (20)

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

Learning spark ch01 - Introduction to Data Analysis with Spark

  • 1. C H A P T E R 0 1 : I N T R O D U C T I O N T O D A T A A N A L Y S I S W I T H S P A R K Learning Spark by Holden Karau et. al.
  • 2. Overview: Introduction to Data Analysis with SPARK  What Is Apache Spark?  A Unified Stack  Spark Core  Spark SQL  Spark Streaming  MLlib  GraphX  Cluster Managers  Who Uses Spark, and for What?  Data Science Tasks  Data Processing Applications  A Brief History of Spark  Spark Versions and Releases  Storage Layers for Spark
  • 3. 1.1 What Is Apache Spark?  Apache Spark is a cluster computing platform  Spark extends MapReduce model to support  Different computations  batch applications,  iterative algorithms,  interactive queries,  and streaming  Run computations in memory  Highly Accessible  simple APIs in Python, Java, Scala, and SQL  rich built-in libraries accessing Hadoop Clusters/Data Sources
  • 4. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 6. 1.2.1 A Unified Stack: Core, SQL, Streaming  Spark Core  Task Scheduling  Memory management  Fault recovery  Storage system interaction  API that defines resilient Distributed Dataset (RDD)  Spark SQL  Provide SQL interface to Spark  Allow programmatic data manipulations mix with SQL  Spark Streaming  Enables processing of live stream data e.g. web logs
  • 7. 1.2.2 A Unified Stack: MLlib, GraphX, ClusterM  MLlib  Contains common machine learning (ML) modules  Classification, Regression, Clustering, Collaborative Filtering  Model evaluation, Data Import, Lower-level ML primitives  GraphX  Extends Spark RDD APIs just like Spark SQL/Streaming  Contains graph algorithms  Cluster Managers  Hadoop YARN, Apache Mesos  Default: Standalone scheduler
  • 8. 1.3 Who Uses Spark, and for What ?  General-purpose framework for cluster computing  Data Scientists  Engineers  Data Scientists  Analyze and Model data  SQL, Statistics, Predictive Model (ML) using Python, R  Use Cases: Interactive shells with Python, Scala, SparkSQL supporting MLlib libraries calling out Matlab/R  Engineers  Data Processing Applications  Principles of SW engineering (Encapsulation, OOP, Interface design)
  • 9. 1.4 A Brief History of Spark  2009: UC Berkeley RAD lab became AMPlab  Start with Hadoop MapReduce was inefficient for interactive computing jobs  designed for interactive and iterative query performance  In-memory storage  Efficient fault recovery 10-20X times faster than MapReduce  Early Adopters  Spark PoweredBy page  Spark Meetups  Spark Summit  2011  Berkeley Data Analytics Stacks (BDAS)
  • 10. 1.5 Spark Versions and Releases  May 2014 Spark 1.1.0  April 2015 Spark 1.3.1  Spark Documentation
  • 11. 1.6 Storage Layers for Spark  Spark can create distributed datasets from  HDFS  Supported by Hadoop API  Local Filesystem  Amazon S3  Cassandra  Hive  Hbase …etc  Supports others  Text file  Sequence file  Arvo  Parquet  Hadoop InputFormat
  • 12. Learn More about Apache Spark

Editor's Notes

  1. Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. First, all libraries and higher- level components in the stack benefit from improvements at the lower layers. Second, the costs associated with running the stack are minimized, because instead of running 5–10 independent software systems, an organization needs to run only one. Finally, one of the largest advantages of tight integration is the ability to build appli‐ cations that seamlessly combine different processing models.