SlideShare a Scribd company logo
1 of 40
Download to read offline
apache spark for
everyone
amcasari + deb siegel
WWConnect 2016, Seattle
+
who: @amcasari
@dsiegel
what: #WWConnect2016
where: @galvanizeSEA
why: @ApacheSpark
(now we can be found)
COORDINATES
@
@
you
don’t worry about this….
you
because we feel the same way…
we are all learning!
we might be a wee bit
ambitious…
+
8 hour
workshop
intro to
spark
8 hour
workshop
cluster
computing
apps
tutorials
+
banging
head
against
keyboard
personal
projects
today
}professional
experience
{now you are safe take a nap….}
https://github.com/
morningc/
wwconnect-2016-
spark4everyone
courtesy YouTube
why do we care about spark?
what we data people do all day:
• lots of data collection, curation + storage
• lots and lots of data engineering
• product development with machine learning
algorithms!
data science
ideation exploration experimentation productization
+
what is spark?
• “fast and general-purpose cluster computing
system”
• advanced cyclic data flow and in-memory
computing > runs 10x-100x faster than Hadoop MR
• interactive shells in several languages (incl. SQL)
• performant + scalable
+
courtesy databricks
WARNING: THINGS CHANGE IN SPARK ALL THE TIME. SOME
THINGS MIGHT BE HIDDEN, NO LONGER ACCESSIBLE.
LIKE SPARK.UNICORNS()
courtesy databricks
what is spark?
+
• multi-language APIs give many different users the
ability to work with Spark
• gateway into Spark but you must still run Spark!
• current languages supported (with various levels of
depth): Scala, Python, Java, R
• moving beyond the shell + text edit
+
what is spark?
courtesy databricks
• Data Sources API provides a “pluggable
mechanism for accessing structured data through
Spark SQL”
• changes the question from “where to store the
data” to “how can we access + work w/ the data”
• not exposed to users, for Spark devs
• supplemented by spark-packages
+
what is spark?
courtesy databricks
• Spark SQL allows you to query structured data in Spark programs either
using SQL or DataFrames API
• can be used in applications + iterative workflows from a shell or notebook
• DataFrames API conceptually similar to a table in a relational database or
data frame in R/Python
• preserves schema of original data for many file formats, including Parquet
• highly optimized, distributed collection of data
• Datasets: experimental interface (Scala + Java)
+
what is spark?
courtesy databricks
• Spark Streaming allows for discretized event-stream
processing
• why would we be excited about it?
• single platform/analysis pipeline for batch + real-
time analysis
• on-line learning for ML applications
+
what is spark?
courtesy databricks
MLlib + GraphX on Apache Spark
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
Isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
Power iteration clustering (PIC)
Latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
Graph analytics
• how can we continue to approach every data science
product with scale + performance as top priority?
+
what is spark?
v1.3 -> v1.6
courtesy databricks
• spark-packages is a hosted module resource
center for packages developed by the Spark
community
• extends functionality + integration options for
current Spark releases
• examples: spark-csv, spark-testing-base
+
what is spark?
courtesy databricks
n.b.> it will not solve every problem for everyone
• not an all-in-one cluster management + admin tool. utilizes other
resource managers (YARN, Mesos, Amazon EC2)
• quickly changing updates (major release every 3 months)
sometimes requires additional work for backwards compatibility
• for small and medium sized data: not necessary for performant
analysis, data science + ML apps
• learning curve is broad for designing cluster applications @
scale
+
what is spark?
how does spark work?
• basic abstraction: Resilient Distributed Dataset (RDD)
• items distributed across many compute nodes that can
be manipulated in parallel
• we primarily use core functionality to quickly build
applications for production, including data processing
for the DataScience Web API
• UDFs expand core functionality
+
courtesy databricks
+
Hadoop MR Spark
how does spark work?
spark != mapreduce
read more here from @pacoid
+
transformation, actions, laziness, DAGs
how does spark work?
how does spark work?
• fault-tolerant cluster computing framework for most
clusters
+
how does spark work on a
cluster?
+
tasks, jobs, stages, oh my!
courtesy databricks
http://localhost:4040/jobs/
spark + notebooks (today)
+
Jupyter R, Scala, Python, Java
Zepplin Scala, Python, Java, SQL
RStudio R
Databricks Scala, Python, Java, SQL, R
• Scala via Databricks Community Edition
• Python via Jupyter
• R via RStudio
• Java via well, we can’t fit in everything….
why notebooks?
+
Moving down the
data pipeline from
raw to results
How best to quickly
move through
pipeline to:
1. Show value of
work
2. Communicate
results
3. Move models into
production
pipeline
problem formulation -> tool chain construction
Clean up your
data
Run
Sophisticated
Analytics
Discover
Insights!
Integrate
Results into
Pipelines
Use lots of different
data sources
ETL Communicate
Results and
Value
…NOTEBOOKS
deb siegel
scala + databricks
community edition
@
scala for spark on two slides!
• Spark is written in scala - no extra language API libraries
or wrappers needed
• Runs on the Java Virtual Machine. Interoperates with JAVA
and is converted to JAVA.
• Can use in REPL/shell and Notebook (ie. Jupyter,
Zeppelin, DBCE).
+
scala for spark on two slides!
• Integrated Object Oriented and Functional Programming
• Data is generally immutable
• Great for spark programming model!
• Anonymous functions
• Statically Typed
• Types checked at compile-time
• Type Inference
• Implicit Type Conversion
• If a method is not found for your object, it will be converted to an object which
does have that method.
• Case Class

Serializable object with override methods such as equals(), hash() and toString().
+
We have some DBCE accounts to give away.
Most of the code can otherwise be used on Apache Zeppelin.
scala + databricks
community edition
demo
amanda casari
pyspark + python +
jupyter
@
what is python?
+
• google: “python is a high-level general-purpose
programming language”
• python.org: “Python is a programming language that
lets you work more quickly and integrate your systems
more effectively.”
• most important: python is open source!
• python is friendly for beginners: programmers + non-
yet-programmers
what is pyspark?
+
• pyspark programming API: you can talk to spark using
python
• spark python API docs fairly well maintained
• you can work on pyspark objects using pyspark or on
python objects using python
• some pyspark functions return python objects
• very popular, well developed on + optimized for cluster
performance
pyspark + jupyter
+
• easy python distribution: anaconda
• installs + manages python packages + soo much more
• creating a PySpark kernel for Jupyter notebook
• running spark from any directory: create a symlink
• starting pySpark + Jupyter from command line:
$ IPYTHON_OPTS=“notebook” pyspark
pyspark + python +
jupyter demo
deb siegel
sparkR + RStudio
@
what is R?
+
• R is a language & environment for statistical
computing and graphics
• Many specialized packages available in CRAN
(Comprehensive R Archive Network)
• RStudio is an open source R notebook / IDE
what is sparkR?
+
• Spark DataFrame is the main data structure. (You can still use R data.frame
but it is not Sparky)
• Translated to RDD[Row] behind the scenes. Columns have names and
data type.
• All local R functions and libraries available (locally on the driver only for now).

• Stage of maturity = evolving: Use for filtering, selecting, and aggregating
your BIG data down to small or medium data which you can work with locally.
• Some machine learning algorithms and stats are available on the BIG data,
more will be added as the SparkR project grows. 

• Can work in REPL/shell or Notebook (eg. RStudio, Jupyter, DBCE, Zeppelin).
what is sparkR?
+
sparkR + RStudio
+
> Sys.setenv(SPARK_HOME="/yourpathto/spark")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME")
, "R", "lib"), .libPaths()))
> library(SparkR)
> sc <-
sparkR.init(master="local[2]",appName="SparkR-
for-
everyone",sparkPackages="com.databricks:spark-
csv_2.11:1.2.0")
>sqlContext <- sparkRSQL.init(sc)
sparkR + RStudio
demo
now your turn…workshoppy bit!
https://github.com/morningc/
wwconnect-2016-spark4everyone
Our possibly overly ambitious
goals:
- Get you set up w/ Spark + a
notebook
- Show you where to find
help
- Get you started on a
notebook
/all the spark for windows
/all the spark for mac os x
/all the spark for linux
Spark runs on Java 7+, Python 2.6+ and R 3.1+, Scala 2.10.x
courtesy xkcd
you are not alone…
+
courtesy xkcd

More Related Content

What's hot

Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemDatabricks
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
DeepLearning4J: Open Source Neural Net Platform
DeepLearning4J: Open Source Neural Net PlatformDeepLearning4J: Open Source Neural Net Platform
DeepLearning4J: Open Source Neural Net PlatformTuri, Inc.
 

What's hot (20)

Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
PySaprk
PySaprkPySaprk
PySaprk
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
DeepLearning4J: Open Source Neural Net Platform
DeepLearning4J: Open Source Neural Net PlatformDeepLearning4J: Open Source Neural Net Platform
DeepLearning4J: Open Source Neural Net Platform
 

Similar to Apache Spark for Everyone - Women Who Code Workshop

20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyoneAmanda Casari
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltreMarco Parenzan
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsXiao Li
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 

Similar to Apache Spark for Everyone - Women Who Code Workshop (20)

20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 

More from Amanda Casari

When Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPRWhen Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPRAmanda Casari
 
Scaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science TeamsScaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science TeamsAmanda Casari
 
Spark Hearts GraphLab Create
Spark Hearts GraphLab CreateSpark Hearts GraphLab Create
Spark Hearts GraphLab CreateAmanda Casari
 
Feature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSPFeature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSPAmanda Casari
 
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...Amanda Casari
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabAmanda Casari
 
PyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive VisualizationsPyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive VisualizationsAmanda Casari
 

More from Amanda Casari (7)

When Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPRWhen Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPR
 
Scaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science TeamsScaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science Teams
 
Spark Hearts GraphLab Create
Spark Hearts GraphLab CreateSpark Hearts GraphLab Create
Spark Hearts GraphLab Create
 
Feature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSPFeature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSP
 
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLab
 
PyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive VisualizationsPyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive Visualizations
 

Recently uploaded

Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 

Recently uploaded (20)

Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 

Apache Spark for Everyone - Women Who Code Workshop

  • 1. apache spark for everyone amcasari + deb siegel WWConnect 2016, Seattle +
  • 2. who: @amcasari @dsiegel what: #WWConnect2016 where: @galvanizeSEA why: @ApacheSpark (now we can be found) COORDINATES @ @
  • 4. you because we feel the same way… we are all learning!
  • 5. we might be a wee bit ambitious… + 8 hour workshop intro to spark 8 hour workshop cluster computing apps tutorials + banging head against keyboard personal projects today }professional experience
  • 6. {now you are safe take a nap….} https://github.com/ morningc/ wwconnect-2016- spark4everyone courtesy YouTube
  • 7. why do we care about spark? what we data people do all day: • lots of data collection, curation + storage • lots and lots of data engineering • product development with machine learning algorithms! data science ideation exploration experimentation productization +
  • 8. what is spark? • “fast and general-purpose cluster computing system” • advanced cyclic data flow and in-memory computing > runs 10x-100x faster than Hadoop MR • interactive shells in several languages (incl. SQL) • performant + scalable + courtesy databricks WARNING: THINGS CHANGE IN SPARK ALL THE TIME. SOME THINGS MIGHT BE HIDDEN, NO LONGER ACCESSIBLE. LIKE SPARK.UNICORNS()
  • 10. • multi-language APIs give many different users the ability to work with Spark • gateway into Spark but you must still run Spark! • current languages supported (with various levels of depth): Scala, Python, Java, R • moving beyond the shell + text edit + what is spark? courtesy databricks
  • 11. • Data Sources API provides a “pluggable mechanism for accessing structured data through Spark SQL” • changes the question from “where to store the data” to “how can we access + work w/ the data” • not exposed to users, for Spark devs • supplemented by spark-packages + what is spark? courtesy databricks
  • 12. • Spark SQL allows you to query structured data in Spark programs either using SQL or DataFrames API • can be used in applications + iterative workflows from a shell or notebook • DataFrames API conceptually similar to a table in a relational database or data frame in R/Python • preserves schema of original data for many file formats, including Parquet • highly optimized, distributed collection of data • Datasets: experimental interface (Scala + Java) + what is spark? courtesy databricks
  • 13. • Spark Streaming allows for discretized event-stream processing • why would we be excited about it? • single platform/analysis pipeline for batch + real- time analysis • on-line learning for ML applications + what is spark? courtesy databricks
  • 14. MLlib + GraphX on Apache Spark Classification and regression linear models (SVMs, logistic regression, linear regression) naive Bayes decision trees ensembles of trees (Random Forests and Gradient-Boosted Trees) Isotonic regression Collaborative filtering alternating least squares (ALS) Clustering k-means Gaussian mixture Power iteration clustering (PIC) Latent Dirichlet allocation (LDA) streaming k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) Optimization (developer) stochastic gradient descent limited-memory BFGS (L-BFGS) Graph analytics • how can we continue to approach every data science product with scale + performance as top priority? + what is spark? v1.3 -> v1.6 courtesy databricks
  • 15. • spark-packages is a hosted module resource center for packages developed by the Spark community • extends functionality + integration options for current Spark releases • examples: spark-csv, spark-testing-base + what is spark? courtesy databricks
  • 16. n.b.> it will not solve every problem for everyone • not an all-in-one cluster management + admin tool. utilizes other resource managers (YARN, Mesos, Amazon EC2) • quickly changing updates (major release every 3 months) sometimes requires additional work for backwards compatibility • for small and medium sized data: not necessary for performant analysis, data science + ML apps • learning curve is broad for designing cluster applications @ scale + what is spark?
  • 17. how does spark work? • basic abstraction: Resilient Distributed Dataset (RDD) • items distributed across many compute nodes that can be manipulated in parallel • we primarily use core functionality to quickly build applications for production, including data processing for the DataScience Web API • UDFs expand core functionality + courtesy databricks
  • 18. + Hadoop MR Spark how does spark work? spark != mapreduce read more here from @pacoid
  • 19. + transformation, actions, laziness, DAGs how does spark work?
  • 20. how does spark work? • fault-tolerant cluster computing framework for most clusters +
  • 21. how does spark work on a cluster? + tasks, jobs, stages, oh my! courtesy databricks http://localhost:4040/jobs/
  • 22. spark + notebooks (today) + Jupyter R, Scala, Python, Java Zepplin Scala, Python, Java, SQL RStudio R Databricks Scala, Python, Java, SQL, R • Scala via Databricks Community Edition • Python via Jupyter • R via RStudio • Java via well, we can’t fit in everything….
  • 23. why notebooks? + Moving down the data pipeline from raw to results How best to quickly move through pipeline to: 1. Show value of work 2. Communicate results 3. Move models into production pipeline problem formulation -> tool chain construction Clean up your data Run Sophisticated Analytics Discover Insights! Integrate Results into Pipelines Use lots of different data sources ETL Communicate Results and Value …NOTEBOOKS
  • 24. deb siegel scala + databricks community edition @
  • 25. scala for spark on two slides! • Spark is written in scala - no extra language API libraries or wrappers needed • Runs on the Java Virtual Machine. Interoperates with JAVA and is converted to JAVA. • Can use in REPL/shell and Notebook (ie. Jupyter, Zeppelin, DBCE). +
  • 26. scala for spark on two slides! • Integrated Object Oriented and Functional Programming • Data is generally immutable • Great for spark programming model! • Anonymous functions • Statically Typed • Types checked at compile-time • Type Inference • Implicit Type Conversion • If a method is not found for your object, it will be converted to an object which does have that method. • Case Class
 Serializable object with override methods such as equals(), hash() and toString(). +
  • 27. We have some DBCE accounts to give away. Most of the code can otherwise be used on Apache Zeppelin. scala + databricks community edition demo
  • 28. amanda casari pyspark + python + jupyter @
  • 29. what is python? + • google: “python is a high-level general-purpose programming language” • python.org: “Python is a programming language that lets you work more quickly and integrate your systems more effectively.” • most important: python is open source! • python is friendly for beginners: programmers + non- yet-programmers
  • 30. what is pyspark? + • pyspark programming API: you can talk to spark using python • spark python API docs fairly well maintained • you can work on pyspark objects using pyspark or on python objects using python • some pyspark functions return python objects • very popular, well developed on + optimized for cluster performance
  • 31. pyspark + jupyter + • easy python distribution: anaconda • installs + manages python packages + soo much more • creating a PySpark kernel for Jupyter notebook • running spark from any directory: create a symlink • starting pySpark + Jupyter from command line: $ IPYTHON_OPTS=“notebook” pyspark
  • 32. pyspark + python + jupyter demo
  • 33. deb siegel sparkR + RStudio @
  • 34. what is R? + • R is a language & environment for statistical computing and graphics • Many specialized packages available in CRAN (Comprehensive R Archive Network) • RStudio is an open source R notebook / IDE
  • 35. what is sparkR? + • Spark DataFrame is the main data structure. (You can still use R data.frame but it is not Sparky) • Translated to RDD[Row] behind the scenes. Columns have names and data type. • All local R functions and libraries available (locally on the driver only for now).
 • Stage of maturity = evolving: Use for filtering, selecting, and aggregating your BIG data down to small or medium data which you can work with locally. • Some machine learning algorithms and stats are available on the BIG data, more will be added as the SparkR project grows. 
 • Can work in REPL/shell or Notebook (eg. RStudio, Jupyter, DBCE, Zeppelin).
  • 37. sparkR + RStudio + > Sys.setenv(SPARK_HOME="/yourpathto/spark") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME") , "R", "lib"), .libPaths())) > library(SparkR) > sc <- sparkR.init(master="local[2]",appName="SparkR- for- everyone",sparkPackages="com.databricks:spark- csv_2.11:1.2.0") >sqlContext <- sparkRSQL.init(sc)
  • 39. now your turn…workshoppy bit! https://github.com/morningc/ wwconnect-2016-spark4everyone Our possibly overly ambitious goals: - Get you set up w/ Spark + a notebook - Show you where to find help - Get you started on a notebook /all the spark for windows /all the spark for mac os x /all the spark for linux Spark runs on Java 7+, Python 2.6+ and R 3.1+, Scala 2.10.x
  • 40. courtesy xkcd you are not alone… + courtesy xkcd