Introduction to Spark

•Download as PPTX, PDF•

0 likes•680 views

David Smelker

Introduction to Spark for the Boulder / Denver Spark meetup

Data & Analytics

What is spark
• In-Memory Map/Reduce Engine
• Spark was developed in 2009 by the Berkley
Amp lab
• Converted to an Apache project in 2013
• Scala based
• Scala, Java, and Python API

Most Active Big Data Project
within Apache
Data from Spark-Summit 2014

Spark
Spark Streaming
Stand
alone
HDFS
Spark
SQL
Tachyon
MLBase
Cassandra
Cloud
Services
GraphX
RDBMS

Spark VS. Hadoop
• Hadoop Map/Reduce Limitations
• High Latency
• No in-memory caching
• Map/Reduce code very complicated to write
• Spark
• In-Memory Processing
• Very Easy API
• Can run stand alone even on Windows
• 100x faster in memory and 10x faster on disk

Spark Word Count Example
file = spark.textFile(“file.name”)
file.flatMap(line = > line.split(“ “))
.map(word=>(word,1))
.reduceByKey(_+_)

RDD – Resilient Distributed
Dataset
• Operations
• Transformations
• Actions
• Persistence
• Allows an RDD to persist between operations
• Provides the ability to write to disk if to large for memory
• Parallelized Collections
• Typically you want 2-4 slices per CPU in your cluster

Operations
Transformations Actions
• Map
• Filter
• Sample
• Join
• ReduceByKey
• GroupByKey
• Distinct
• Reduce
• Collect
• Count
• First, Take
• SaveAs
• CountByKey

Persistence
• Store a RDD for later operations
• Each node persists a partition
• Partitions are fault-tolerant
• persist() or cache()

Persistence storage levels
• MEMORY_ONLY - Store RDD as deserialized Java objects in the
JVM
• MEMORY_AND_DISK - Store RDD as deserialized Java objects
in the JVM. If the RDD does not fit in memory, store the partitions
that don't fit on disk
• MEMORY_ONLY_SER - Store RDD as serialized Java objects
(one byte array per partition).
• MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER,
but spill partitions that don't fit in memory to disk
• DISK_ONLY - Store the RDD partitions only on disk.
• MEMORY_ONLY_2, MEMORY_AND_DISK_2 - Same as the
levels above, but replicate each partition on two cluster nodes.
• OFF_HEAP - Store RDD in serialized format in Tachyon

Spark Advantages
• Same code can be used for streaming and batch
processing
• In Memory Processing
• Fault tolerant rdd persistence
• Machine Learning library built in
• Spark SQL (Coming Soon)
• Data Graphing (GraphX, Bagel/Pregel)

Spark Drawbacks
• No append for output
• Lack of job schedule
• Spark on Yarn not quite ready for prime time
• Still a young project

What's hot

IBM Spark Meetup - RDD & Spark BasicsSatya Narayan

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar

Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Spark architectureGauravBiswas9

Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal

Apache spark coreThành Nguyễn

Apache spark - Spark's distributed programming modelMartin Zapletal

Apache Spark overviewDataArt

Apache Spark II (SparkSQL)Datio Big Data

SPARK ARCHITECTUREGauravBiswas9

Spark 计算模型wang xing

Intro to Apache SparkRobert Sanders

Spark Deep DiveCorey Nolet

Introduction to apache spark Aakashdata

Ten tools for ten big data areas 02_TableauWill Du

Apache spark - History and market overviewMartin Zapletal

Spark introduction and architectureSohil Jain

Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov

Introduction to Apache SparkDatio Big Data

What's hot (20)

IBM Spark Meetup - RDD & Spark Basics

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

Geek Night - Functional Data Processing using Spark and Scala

Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab

Spark architecture

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive

Apache spark core

Apache spark - Spark's distributed programming model

Apache Spark overview

Apache Spark II (SparkSQL)

SPARK ARCHITECTURE

Spark 计算模型

Intro to Apache Spark

Spark Deep Dive

Introduction to apache spark

Ten tools for ten big data areas 02_Tableau

Apache spark - History and market overview

Spark introduction and architecture

Spark Based Distributed Deep Learning Framework For Big Data Applications

Introduction to Apache Spark

Viewers also liked

Introduction to Apache SparkSamy Dindane

LLAP: Sub-Second Analytical Queries in HiveDataWorks Summit/Hadoop Summit

Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red_Hat_Storage

Analyzing Hadoop Data Using Sparklyr Cloudera, Inc.

LLAP: long-lived execution in HiveDataWorks Summit

Data Engineering: Elastic, Low-Cost Data Processing in the CloudCloudera, Inc.

Modern Data Architectures for Business OutcomesAmazon Web Services

Apache Spark & Hadoop : Train-the-trainerIMC Institute

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.

What's New in Pentaho 7.0?Xpand IT

Enabling the Connected Car Revolution Cloudera, Inc.

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale

Viewers also liked (12)

Introduction to Apache Spark

LLAP: Sub-Second Analytical Queries in Hive

Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...

Analyzing Hadoop Data Using Sparklyr 

LLAP: long-lived execution in Hive

Data Engineering: Elastic, Low-Cost Data Processing in the Cloud

Modern Data Architectures for Business Outcomes

Apache Spark & Hadoop : Train-the-trainer

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...

What's New in Pentaho 7.0?

Enabling the Connected Car Revolution 

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models

Similar to Introduction to Spark

Intro to Apache Sparkclairvoyantllc

Unit II Real Time Data Processing tools.pptxRahul Borate

Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniDemi Ben-Ari

Apache Spark FundamentalsZahra Eskandari

Introduction to Apache Spark EcosystemBojan Babic

Spark architechure.pptxSaiSriMadhuriYatam

Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin

A Deep Dive Into SparkAshish kumar

APACHE SPARK.pptxDeepaThirumurugan

Scalding by Adform Research, Alex GryzlovVasil Remeniuk

In-memory databaseChien Nguyen Dang

Apache Spark CoreGirish Khanzode

SparkHeena Madan

Apache Spark for BeginnersAnirudh

Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

Spark_RDD_SyedAcademySyed Hadoop

Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks

Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati

DUG'20: 02 - Accelerating apache spark with DAOS on AuroraAndrey Kudryavtsev

Similar to Introduction to Spark (20)

Intro to Apache Spark

Unit II Real Time Data Processing tools.pptx

Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni

Apache Spark Fundamentals

Introduction to Apache Spark Ecosystem

Spark architechure.pptx

Introduction to Apache Spark :: Lagos Scala Meetup session 2

A Deep Dive Into Spark

APACHE SPARK.pptx

Scalding by Adform Research, Alex Gryzlov

In-memory database

Apache Spark Core

Spark

Apache Spark for Beginners

Large Scale Data Analytics with Spark and Cassandra on the DSE Platform

Apache Spark on HDinsight Training

Spark_RDD_SyedAcademy

Improving Apache Spark by Taking Advantage of Disaggregated Architecture

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

DUG'20: 02 - Accelerating apache spark with DAOS on Aurora

Recently uploaded

Learn How Data Science Changes Our WorldEduminds Learning

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

RadioAdProWritingCinderellabyButleri.pdfgstagge

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Machine learning classification ppt.pptamreenkhanum0307

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Recently uploaded (20)

Learn How Data Science Changes Our World

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

DBA Basics: Getting Started with Performance Tuning.pdf

Identifying Appropriate Test Statistics Involving Population Mean

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Semantic Shed - Squashing and Squeezing.pptx

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Call Girls in Saket 99530🔝 56974 Escort Service

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

20240419 - Measurecamp Amsterdam - SAM.pdf

RadioAdProWritingCinderellabyButleri.pdf

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Machine learning classification ppt.ppt

Top 5 Best Data Analytics Courses In Queens

Generative AI for Social Good at Open Data Science East 2024

Introduction to Spark

1. Intro to Spark Dave Smelker

2. What is spark • In-Memory Map/Reduce Engine • Spark was developed in 2009 by the Berkley Amp lab • Converted to an Apache project in 2013 • Scala based • Scala, Java, and Python API

3. Most Active Big Data Project within Apache Data from Spark-Summit 2014

4. Spark Spark Streaming Stand alone HDFS Spark SQL Tachyon MLBase Cassandra Cloud Services GraphX RDBMS

5. Spark VS. Hadoop • Hadoop Map/Reduce Limitations • High Latency • No in-memory caching • Map/Reduce code very complicated to write • Spark • In-Memory Processing • Very Easy API • Can run stand alone even on Windows • 100x faster in memory and 10x faster on disk

6. HadoopWord Count Example See Code

7. Spark Word Count Example file = spark.textFile(“file.name”) file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)

8. RDD – Resilient Distributed Dataset • Operations • Transformations • Actions • Persistence • Allows an RDD to persist between operations • Provides the ability to write to disk if to large for memory • Parallelized Collections • Typically you want 2-4 slices per CPU in your cluster

9. Operations Transformations Actions • Map • Filter • Sample • Join • ReduceByKey • GroupByKey • Distinct • Reduce • Collect • Count • First, Take • SaveAs • CountByKey

10. Operations continued

11. Persistence • Store a RDD for later operations • Each node persists a partition • Partitions are fault-tolerant • persist() or cache()

12. Persistence storage levels • MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM • MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk • MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition). • MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk • DISK_ONLY - Store the RDD partitions only on disk. • MEMORY_ONLY_2, MEMORY_AND_DISK_2 - Same as the levels above, but replicate each partition on two cluster nodes. • OFF_HEAP - Store RDD in serialized format in Tachyon

13. Spark Advantages • Same code can be used for streaming and batch processing • In Memory Processing • Fault tolerant rdd persistence • Machine Learning library built in • Spark SQL (Coming Soon) • Data Graphing (GraphX, Bagel/Pregel)

14. Spark Drawbacks • No append for output • Lack of job schedule • Spark on Yarn not quite ready for prime time • Still a young project

15. Questions?

Introduction to Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Introduction to Spark

Similar to Introduction to Spark (20)

Recently uploaded

Recently uploaded (20)

Introduction to Spark