Use r tutorial part1, introduction to sparkr

•Download as PPTX, PDF•

3 likes•7,182 views

Databricks

Presentation given at useR 2016 at http://user2016.org/tutorials/11.html

Technology

Introduction to SparkR
Shivaram Venkataraman, Hossein Falaki

Big Data & R
DataFrames
Visualization
Libraries
Data+

Big Data & R: Challenges
Data access
HDFS, Hive
Capacity
Single machine
memory Parallelism
Single Thread

Apache Spark
Engine for large-scale data processing
Fast, Easy to Use
Runs Everywhere
EC2, clusters, laptop etc.

Speed
Scalable
Flexible
Statistics
Visualization
DataFrames
SparkR

Big Data & R: Patterns
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning

1. Big Data, Small Learning
Data
Cleaning
Filtering
Aggregation
Collect
Subset
DataFrames
Visualizatio
n
Libraries

1. Big Data, Small Learning
songs <- read.df(
“songs.json”,
“json”)
newSongs <- filter(
songs,
songs$year > 2000)
ggplot(collect(newSongs))
Data
Cleaning
Filtering
Aggregation
Collect
Subset

2. Partition Aggregate
Data Best
Model
Params
Parameter Tuning

params<-c(1e-3,1e-1,1e2)
data <- read.csv(“t.csv”)
train <- function(prm) {
lm.ridge(“y ~ x+z”,
data, prm)
}
lapply(params, train)
2. Partition Aggregate
Data Best
Model
Params

3. Large Scale Machine Learning
Data Featurize Learning Model

3. Large Scale Machine Learning
Data Featurize Learning Model
training <- read.csv(
“t.csv”)
model <- glm(
delay~Distance+Dest,
family = “gaussian”,
data=data)
summary(model)

Big Data & R
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning
SparkR:
Unified approach

SparkR DataFrames
people <- read.df(
“people.json”,
“json”)
avgAge <- select(
df,
avg(df$age))
head(avgAge)
Number of data sources
Column Functions, SQL
Support for R UDFs

Large Scale Machine Learning
Integration with MLLib
Key Features
R-like formulas
Model statistics
model <- glm(
a ~ b + c,
data = df)
summary(model)

Partition Aggregate
spark.lapply: Simple, parallel API
Ex: Parameter tuning, Model Averaging
Include existing R packages

SparkR Status
Open source -- Part of Apache Spark
> 60 committers from UC Berkeley, Databricks,
IBM, Intel, Alteryx etc.
Contributions welcome !

Tutorial Outline
Part 1: Data Exploration
• ETL: Data loading, schema
• Exploration: Filter, clean, aggregate etc.
• Visualization: Integration with ggplot
Part 2: Advanced Analytics (After the break)

Tutorial Setup
Each user gets a dedicated micro cluster
• Cluster is terminated after 1 hour of inactivity
• Multiple users can collaborate on a notebook
Notebooks can be exported/imported
Examples and tutorials in R/Python/Scala
Free online service for learning Apache Spark

Tutorial Setup
Databricks Notebooks
• Interactive workspace
• Markdown + R, Python, Scala, SQL
Sign up at http://databricks.com/ce

Tutorial Setup
Fill out our survey at
tiny.cc/sparkr-user-survey

SparkR
Big data processing from R
DataFrames for ETL, data exploration
Support for advanced analytics

Tutorial Next Steps
Sign up at http://databricks.com/ce
Part 1: tiny.cc/sparkr-tutorial-part1
Fill out our survey at tiny.cc/sparkr-user-survey

What's hot

Parallelize R Code Using Apache Spark Databricks

Jump Start into Apache® Spark™ and DatabricksDatabricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit

Operational Tips for Deploying SparkDatabricks

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks

Introduction to Spark (Intern Event Presentation)Databricks

Structuring Spark: DataFrames, Datasets, and StreamingDatabricks

Stanford CS347 Guest Lecture: Apache SparkReynold Xin

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

SparkSQL: A Compiler from Queries to RDDsDatabricks

Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks

Spark streaming State of the Union - Strata San Jose 2015Databricks

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

End-to-end Data Pipeline with Apache SparkDatabricks

Enabling exploratory data science with Spark and RDatabricks

What's hot (20)

Parallelize R Code Using Apache Spark

Jump Start into Apache® Spark™ and Databricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...

Operational Tips for Deploying Spark

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

From Pipelines to Refineries: Scaling Big Data Applications

Spark Under the Hood - Meetup @ Data Science London

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...

Introduction to Spark (Intern Event Presentation)

Structuring Spark: DataFrames, Datasets, and Streaming

Stanford CS347 Guest Lecture: Apache Spark

A Journey into Databricks' Pipelines: Journey and Lessons Learned

SparkSQL: A Compiler from Queries to RDDs

Designing Structured Streaming Pipelines—How to Architect Things Right

Spark streaming State of the Union - Strata San Jose 2015

Jump Start with Apache Spark 2.0 on Databricks

End-to-end Data Pipeline with Apache Spark

Enabling exploratory data science with Spark and R

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand

DMCC Future of Trade Web3 - Special Edition

Commit 2024 - Secret Management made easy

DevEX - reference for building teams, processes, and platforms

Anypoint Exchange: It’s Not Just a Repo!

Advanced Test Driven-Development @ php[tek] 2024

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Vertex AI Gemini Prompt Engineering Tips

Streamlining Python Development: A Guide to a Modern Project Setup

How AI, OpenAI, and ChatGPT impact business and software.

Connect Wave/ connectwave Pitch Deck Presentation

Advanced Computer Architecture – An Introduction

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Ensuring Technical Readiness For Copilot in Microsoft 365

Human Factors of XR: Using Human Factors to Design XR Systems

The Ultimate Guide to Choosing WordPress Pros and Cons

Dev Dives: Streamline document processing with UiPath Studio Web

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

CloudStudio User manual (basic edition):

Use r tutorial part1, introduction to sparkr

1. Introduction to SparkR Shivaram Venkataraman, Hossein Falaki

2. Big Data & R DataFrames Visualization Libraries Data+

3. Big Data & R: Challenges Data access HDFS, Hive Capacity Single machine memory Parallelism Single Thread

4. Apache Spark Engine for large-scale data processing Fast, Easy to Use Runs Everywhere EC2, clusters, laptop etc.

5. Speed Scalable Flexible Statistics Visualization DataFrames SparkR

6. Big Data & R: Patterns Big Data Small Learning Partition Aggregate Large Scale Machine Learning

7. 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset DataFrames Visualizatio n Libraries

8. 1. Big Data, Small Learning songs <- read.df( “songs.json”, “json”) newSongs <- filter( songs, songs$year > 2000) ggplot(collect(newSongs)) Data Cleaning Filtering Aggregation Collect Subset

9. 2. Partition Aggregate Data Best Model Params Parameter Tuning

10. params<-c(1e-3,1e-1,1e2) data <- read.csv(“t.csv”) train <- function(prm) { lm.ridge(“y ~ x+z”, data, prm) } lapply(params, train) 2. Partition Aggregate Data Best Model Params

11. 3. Large Scale Machine Learning Data Featurize Learning Model

12. 3. Large Scale Machine Learning Data Featurize Learning Model training <- read.csv( “t.csv”) model <- glm( delay~Distance+Dest, family = “gaussian”, data=data) summary(model)

13. Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning SparkR: Unified approach

14. SparkR DataFrames people <- read.df( “people.json”, “json”) avgAge <- select( df, avg(df$age)) head(avgAge) Number of data sources Column Functions, SQL Support for R UDFs

15. Large Scale Machine Learning Integration with MLLib Key Features R-like formulas Model statistics model <- glm( a ~ b + c, data = df) summary(model)

16. Partition Aggregate spark.lapply: Simple, parallel API Ex: Parameter tuning, Model Averaging Include existing R packages

17. SparkR Status Open source -- Part of Apache Spark > 60 committers from UC Berkeley, Databricks, IBM, Intel, Alteryx etc. Contributions welcome !

18. Tutorial Outline Part 1: Data Exploration • ETL: Data loading, schema • Exploration: Filter, clean, aggregate etc. • Visualization: Integration with ggplot Part 2: Advanced Analytics (After the break)

19. Tutorial Setup Each user gets a dedicated micro cluster • Cluster is terminated after 1 hour of inactivity • Multiple users can collaborate on a notebook Notebooks can be exported/imported Examples and tutorials in R/Python/Scala Free online service for learning Apache Spark

20. Tutorial Setup Databricks Notebooks • Interactive workspace • Markdown + R, Python, Scala, SQL Sign up at http://databricks.com/ce

21. Tutorial Setup Fill out our survey at tiny.cc/sparkr-user-survey

22. SparkR Big data processing from R DataFrames for ETL, data exploration Support for advanced analytics

23. Tutorial Next Steps Sign up at http://databricks.com/ce Part 1: tiny.cc/sparkr-tutorial-part1 Fill out our survey at tiny.cc/sparkr-user-survey

Use r tutorial part1, introduction to sparkr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Use r tutorial part1, introduction to sparkr

Similar to Use r tutorial part1, introduction to sparkr (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Use r tutorial part1, introduction to sparkr