Sparklyr is an R package that lets you analyze data in Spark while using familiar tools in R. Sparklyr supports a complete backend for dplyr, a popular tool for working with data frame objects both in memory and out of memory. You can use dplyr to translate R code into Spark SQL. Sparklyr also supports MLlib so you can run classifiers, regressions, clustering, decision trees, and many more machine learning algorithms on your distributed data in Spark. With sparklyr you can analyze large amounts of data that would not traditionally fit into R memory. Then you can collect results from Spark into R for further visualization and documentation.
Sparklyr is also extensible. You can create R packages that depend on sparklyr to call the full Spark API. One example of an extension is H2O’s rsparkling, an R package that works with H2O’s machine learning algorithm. With sparklyr and rsparkling you have access to all the tools in H2O for analysis with R and Spark.
In this presentation I will demonstrate how to analyze data in Spark by using sparklyr and rsparkling.
R and Spark: How to Analyze Data Using RStudio's Sparklyr and H2O's Rsparkling Packages: Spark Summit East talk by Nathan Stephens
1. ANALYZE DATA USING RSTUDIO'S SPARKLYR
R AND SPARK
https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
2. Apache Spark
• Huge investments in big data and Hadoop
• Data scientists wanting to analyze data at scale
• Rapid progress and adoption in Spark libraries
R and RStudio
• Wide range of tools and packages
• Powerful ways to share insights
• Interactive notebooks
• Great visualizations
What we hear from our customers
3. Best of both worlds
If you are investing in Spark,
then there is nothing
stopping you from using it
with the full power of R
Using R with Spark
4. Benefits of Spark for the R user
Apache Spark…
• Can integrate with Hadoop
• Supports familiar SQL syntax
• Has built-in machine learning
• Is designed for performance
• Great for interactive data analysis
R users can take advantage
of all these investments
5. New! Open-source
R package from RStudio
• Integrated with the RStudio IDE
• Sparklyr is a dplyr back-end for Spark
• Extensible foundation for Spark
applications and R
sparklyr
http://spark.rstudio.com/
6. Create your own R
packages with
interfaces to Spark
•Interfaces to custom
machine learning pipelines
•Interfaces to 3rd party
Spark packages
•Many other R interfaces
sparklyr extensions
Example
Count the number of lines in a file
Extension
library(sparklyr)
count_lines <- function(sc, file) {
spark_context(sc) %>%
invoke("textFile", file, 1L) %>%
invoke("count")
}
Call
sc <- spark_connect(master = "local")
count_lines(sc, "hdfs://path/data.csv")
7. R for data science toolchain
“You’ll learn how to get your data into R
[with Spark], get it into the most useful
structure, transform it, visualize it and
model it [with Spark].”
8. Import
Create a connection
sc <- spark_connect()
Import data from file/S3/HDFS/R
spark_read_csv(sc,“table”,“hdfs://<path>”)
sdf_copy_to(sc, table,“table”)
nyct2010_tbl <- tbl(sc,“table")
Write data
spark_write_parquet(table,“hdfs://<path>”)
Sparklyr
Connect to Spark.
Read and write data in
CSV, JSON, and Parquet
formats.
Data can be stored in
HDFS, S3, or on the
local filesystem.
9. Wrangle
dplyr
my_tbl %>%
filter(Petal_Width < 0.3) %>%
select(Petal_Length, Petal_Width)
Spark SQL
select Petal_Length, Petal_Width
from mytable
where Petal_Width < 0.3
Use dplyr to write
Spark SQL
A fast, consistent tool
for working with data
frame like objects,
both in memory and
out of memory.
11. Model
Models
K-means
Linear regression
Logistic regression
Survival regression
Generalized linear regression
Decision trees
Random forests
Gradient boosted trees
Principal component analysis
Naive Bayes
Multilayer perceptron
Latent Dirichlet allocation
One vs rest
Industry Specific
Chemometrics
ClinicalTrials
Econometrics
Environmetrics
Finance
Genetics
Pharmacokinetics
Phylogenetics
Psychometrics
Social Sciences
Models
GLMNet
Bayesian regression
Multinomial regression
Random Forest
Gradient boosted machine
Decision trees
Multi-Layer Perceptron
Auto-encoder
Restricted Boltzmann
K-Means
LSH
SVD
ALS
ARIMA
Forecasting
Collaborative filtering
Solvers and optimization
General Topics
Machine Learning
Bayesian
Cluster
Design of experiments
ExtremeValue
Meta Analsis
Multivariate
NLP
Robust methods
Spatial
Survival
Time Series
Graphical models
No data movement required.
Native ML algorithms.
Fast growing ecosystem.
Over 10,000 packages.
Time tested, industry specific models.
Integrated with other R packages
Scalable, high performance models.
Wide variety of algorithms.
Useful diagnostics.
MLlib
13. Demo
Analyzing 1 billion records with Spark and R
http://colorado.rstudio.com:3939/content/262/
https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
14. rsparkling extension
Spark is extensible…
sparklyr is extensible
https://github.com/h2oai/rsparkling/blob/master/R/h2o_context.R#L53
Spark
R H2O
rsparkling
sparklyr
h2o
sparkling
water
15. Benefits Limitations
No data movement required.
Native ML algorithms.
Fast growing ecosystem.
Comparatively fewer algorithms
and fewer diagnostics.
Scalable, high performance models.
Wide variety of algorithms.
Useful diagnostics.
Data conversion requires 3-4X memory.
Added complexity around introducing and
learning another tool.
Access to CRAN packages, visualization,
reporting tools, and time tested algorithms.
Data collection is expensive
and collection size is limited (< 10 GB).
Where should I model my data?
Others…
MLlib