Introduction to Spark (Intern Event Presentation)

•

15 likes•2,939 views

Databricks

An introduction to Apache Spark from its creator, Matei Zaharia, for the intern event hosted by Databricks.

Software

Introduction to Spark
Matei Zaharia
Databricks Intern Event, August 2015

What is Apache Spark?
Fast and general computing engine for clusters
Makes it easy and fast to process large datasets
• APIs in Java, Scala, Python, R
• Libraries for SQL, streaming, machine learning, …
• 100x faster than Hadoop MapReduce for some apps

About Databricks
Founded by creators of Spark in 2013
Oﬀers a hosted cloud service built on Spark
• Interactive workspace with notebooks, dashboards, jobs

0
20
40
60
80
100
120
140
160
2010 2011 2012 2013 2014 2015
Contributors
Contributors / Month to Spark
Community Growth
Most active open source project in
big data

Spark Programming Model
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets (RDDs)
• Collections of objects stored in memory or disk across a cluster
• Built via parallel transformations (map, filter, …)
• Automatically rebuilt on failure

Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines
=
spark.textFile(“hdfs://...”)

errors
=
lines.filter(lambda
s:
s.startswith(“ERROR”))

messages
=
errors.map(lambda
s:
s.split(‘t’)[2])

messages.cache()

Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda
s:
“MySQL”
in
s).count()

messages.filter(lambda
s:
“Redis”
in
s).count()

.
.
.

tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
0.5 sec (vs 20s for on-disk data)

Example: Logistic Regression
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
Iterative algorithm used in machine learning

Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014 Record:
Spark
207 machines
23 minutes
On-Disk Performance
Time to sort 100TB

Higher-Level Libraries
Spark
Spark
Streaming
real-time
Spark SQL
structured data
MLlib
machine
learning
GraphX
graph

Higher-Level Libraries
//
Load
data
using
SQL

points
=
ctx.sql(“select
latitude,
longitude
from
tweets”)

//
Train
a
machine
learning
model

model
=
KMeans.train(points,
10)

//
Apply
it
to
a
stream

sc.twitterStream(...)

.map(lambda
t:
(model.predict(t.location),
1))

.reduceByWindow(“5s”,
lambda
a,
b:
a
+
b)

Over 1000 production users, clusters up to 8000 nodes
Many talks online at spark-summit.org
Spark Community

Ongoing Work
Speeding up Spark through code generation and
binary processing (Project Tungsten)
R interface to Spark (SparkR)
Real-time machine learning library
Frontend and backend work in Databricks
(visualization, collaboration, auto-scaling, …)

What's hot

Introduction to Spark Streamingdatamantra

Introduction to apache spark Aakashdata

Apache Spark OverviewVadim Y. Bichutskiy

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Understanding Query Plans and Spark UIsDatabricks

Deep Dive into the New Features of Apache Spark 3.0Databricks

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Spark streamingWhiteklay

Introduction to sparkHome

Apache Spark Introductionsudhakara st

Spark introduction and architectureSohil Jain

Hadoop vs Apache SparkALTEN Calsoft Labs

Introduction to Spark InternalsPietro Michiardi

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Optimizing Apache Spark SQL JoinsDatabricks

Intro to Apache SparkRobert Sanders

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!

Deep Dive: Memory Management in Apache SparkDatabricks

What's hot (20)

Introduction to Spark Streaming

Introduction to apache spark

Apache Spark Overview

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Processing Large Data with Apache Spark -- HasGeek

Understanding Query Plans and Spark UIs

Deep Dive into the New Features of Apache Spark 3.0

Introducing DataFrames in Spark for Large Scale Data Science

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Spark streaming

Introduction to spark

Apache Spark Introduction

Spark introduction and architecture

Hadoop vs Apache Spark

Introduction to Spark Internals

A Deep Dive into Query Execution Engine of Spark SQL

Optimizing Apache Spark SQL Joins

Intro to Apache Spark

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...

Deep Dive: Memory Management in Apache Spark

Viewers also liked

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

2016 Spark Summit East Keynote: Matei ZahariaDatabricks

Introduction to Apache Sparkdatamantra

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

Parallelizing Existing R Packages with SparkRDatabricks

Internship presentationsamcrosier

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA

Spark - The beginningsDaniel Leon

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly

Apache SparkMahdi Esmailoghli

Introduction to Apache SparkAnastasios Skarlatidis

Intro to Apache SparkCloudera, Inc.

Apache spark linkedinYukti Kaura

New directions for Apache Spark in 2015Databricks

The Evolution of Data Analysis with Hadoop - StampedeCon 2014StampedeCon

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly

Extreme-scale Ad-Tech using Spark and Databricks at MediaMathSpark Summit

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Apache spark - Spark's distributed programming modelMartin Zapletal

Viewers also liked (20)

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Spark Under the Hood - Meetup @ Data Science London

2016 Spark Summit East Keynote: Matei Zaharia

Introduction to Apache Spark

Apache Spark 2.0: Faster, Easier, and Smarter

Parallelizing Existing R Packages with SparkR

Internship presentation

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks

Spark - The beginnings

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...

Apache Spark

Introduction to Apache Spark

Intro to Apache Spark

Apache spark linkedin

New directions for Apache Spark in 2015

The Evolution of Data Analysis with Hadoop - StampedeCon 2014

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...

Extreme-scale Ad-Tech using Spark and Databricks at MediaMath

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Apache spark - Spark's distributed programming model

Similar to Introduction to Spark (Intern Event Presentation)

Unified Big Data Processing with Apache SparkC4Media

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

End-to-end Data Pipeline with Apache SparkDatabricks

20170126 big data processingVienna Data Science Group

Artigo 81 - spark_tutorial.pdfWalmirCouto3

Apache Spark Overview @ ferretAndrii Gakhov

Paris Data Geek - Spark Streaming Djamel Zouaoui

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

An introduction To Apache SparkAmir Sedighi

Apache Spark RDDsDean Chen

Apache Spark & HadoopMapR Technologies

Apache spark-melbourne-april-2015-meetupNed Shawa

In Memory Analytics with Apache SparkVenkata Naga Ravi

Spark Study NotesRichard Kuo

Azure Databricks is Easier Than You ThinkIke Ellis

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей

Similar to Introduction to Spark (Intern Event Presentation) (20)

Unified Big Data Processing with Apache Spark

Unified Big Data Processing with Apache Spark (QCON 2014)

End-to-end Data Pipeline with Apache Spark

20170126 big data processing

Artigo 81 - spark_tutorial.pdf

Apache Spark Overview @ ferret

Paris Data Geek - Spark Streaming

Jump Start with Apache Spark 2.0 on Databricks

20130912 YTC_Reynold Xin_Spark and Shark

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

An introduction To Apache Spark

Apache Spark RDDs

Apache Spark & Hadoop

Apache spark-melbourne-april-2015-meetup

In Memory Analytics with Apache Spark

Spark Study Notes

Azure Databricks is Easier Than You Think

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...

Recently uploaded

eSoftTools IMAP Backup Software and migration toolsosttopstonverter

Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan

Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea

Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp

Sending Calendar Invites on SES and Calendarsnack.pdf31events.com

Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1

2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin

Ronisha Informatics Private Limited Catalogueitservices996

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp

Post Quantum Cryptography – The Impact on Identityteam-WIBU

Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne

SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp

Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions

How to submit a standout Adobe Champion ApplicationBradBedford3

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

Precise and Complete Requirements? An Elusive GoalLionel Briand

Recently uploaded (20)

eSoftTools IMAP Backup Software and migration tools

Simplifying Microservices & Apps - The art of effortless development - Meetup...

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording

Keeping your build tool updated in a multi repository world

Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...

A healthy diet for your Java application Devoxx France.pdf

The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx

Sending Calendar Invites on SES and Calendarsnack.pdf

Amazon Bedrock in Action - presentation of the Bedrock's capabilities

2024 DevNexus Patterns for Resiliency: Shuffle shards

Ronisha Informatics Private Limited Catalogue

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx

Post Quantum Cryptography – The Impact on Identity

Not a Kubernetes fan? The state of PaaS in 2024

SpotFlow: Tracking Method Calls and States at Runtime

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf

Best Angular 17 Classroom & Online training - Naresh IT

How to submit a standout Adobe Champion Application

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

Precise and Complete Requirements? An Elusive Goal

Introduction to Spark (Intern Event Presentation)

1. Introduction to Spark Matei Zaharia Databricks Intern Event, August 2015

2. What is Apache Spark? Fast and general computing engine for clusters Makes it easy and fast to process large datasets • APIs in Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, … • 100x faster than Hadoop MapReduce for some apps

3. About Databricks Founded by creators of Spark in 2013 Oﬀers a hosted cloud service built on Spark • Interactive workspace with notebooks, dashboards, jobs

4. 0 20 40 60 80 100 120 140 160 2010 2011 2012 2013 2014 2015 Contributors Contributors / Month to Spark Community Growth Most active open source project in big data

5. Spark Programming Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) • Collections of objects stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

6. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “MySQL” in s).count() messages.filter(lambda s: “Redis” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

7. Example: Logistic Regression 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s Iterative algorithm used in machine learning

8. Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes On-Disk Performance Time to sort 100TB

9. Higher-Level Libraries Spark Spark Streaming real-time Spark SQL structured data MLlib machine learning GraphX graph

10. Higher-Level Libraries // Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

11. Demo

12. Over 1000 production users, clusters up to 8000 nodes Many talks online at spark-summit.org Spark Community

13.

14. Ongoing Work Speeding up Spark through code generation and binary processing (Project Tungsten) R interface to Spark (SparkR) Real-time machine learning library Frontend and backend work in Databricks (visualization, collaboration, auto-scaling, …)

15. Thank you. We’re hiring!

Editor's Notes

Add “variables” to the “functions” in functional programming
100 GB of data on 50 m1.xlarge EC2 machines
Alibab, tenzent At Berkeley, we have been working on a solution since 2009. This solution consists of a software stack for data analytics, called the Berkeley Data Analytics Stack. The centerpiece of this stack is Spark. Spark has seen significant adoption with hundreds of companies using it, out of which around sixteen companies have contributed back the code. In addition, Spark has been deployed on clusters that exceed 1,000 nodes.

Introduction to Spark (Intern Event Presentation)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Spark (Intern Event Presentation)

Similar to Introduction to Spark (Intern Event Presentation) (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Introduction to Spark (Intern Event Presentation)

Editor's Notes