SlideShare a Scribd company logo
1 of 52
Download to read offline
Unified Big Data Processing 
with Apache Spark 
Matei Zaharia 
@matei_zaharia
What is Apache Spark? 
Fast & general engine for big data processing 
Generalizes MapReduce model to support more types 
of processing 
Most active open source project in big data
About Databricks 
Founded by the creators of Spark in 2013 
Continues to drive open source Spark development, 
and offers a cloud service (Databricks Cloud) 
Partners to support Spark with Cloudera, MapR, 
Hortonworks, Datastax
Spark Community 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
2000 
1500 
1000 
500 
0 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
350000 
300000 
250000 
200000 
150000 
100000 
50000 
0 
Commits Lines of Code Changed 
Activity in past 6 months
Community Growth 
100 
75 
50 
25 
0 
Contributors per Month to Spark 
2010 2011 2012 2013 2014 
2-3x more activity than Hadoop, Storm, 
MongoDB, NumPy, D3, Julia, …
Overview 
Why a unified engine? 
Spark execution model 
Why was Spark so general? 
What’s next
History: Cluster Programming Models 
2004
MapReduce 
A general engine for batch processing
Beyond MapReduce 
MapReduce was great for batch processing, but users 
quickly needed to do more: 
> More complex, multi-pass algorithms 
> More interactive ad-hoc queries 
> More real-time stream processing 
Result: many specialized systems for these workloads
Big Data Systems Today 
MapReduce 
Pregel 
Giraph 
Presto 
Storm 
Dremel 
Drill 
Impala 
S4 . . . 
Specialized systems 
for new workloads 
General batch 
processing
Problems with Specialized Systems 
More systems to manage, tune, deploy 
Can’t combine processing types in one application 
> Even though many pipelines need to do this! 
> E.g. load data with SQL, then run machine learning 
In many pipelines, data exchange between 
engines is the dominant cost!
MapReduce 
Pregel 
Giraph 
Presto 
Storm 
Dremel 
Drill 
Impala 
S4 
Specialized systems 
for new workloads 
General batch 
processing 
Unified engine 
Big Data Systems Today 
? . . .
Overview 
Why a unified engine? 
Spark execution model 
Why was Spark so general? 
What’s next
Background 
Recall 3 workloads were issues for MapReduce: 
> More complex, multi-pass algorithms 
> More interactive ad-hoc queries 
> More real-time stream processing 
While these look different, all 3 need one thing that 
MapReduce lacks: efficient data sharing
Data Sharing in MapReduce 
iter. 1 iter. 2 . . . 
Input 
HDFS 
read 
HDFS 
write 
HDFS 
read 
HDFS 
write 
Input 
query 1 
query 2 
query 3 
result 1 
result 2 
result 3 
. . . 
HDFS 
read 
Slow due to data replication and disk I/O
What We’d Like 
iter. 1 iter. 2 . . . 
Input 
Distributed 
memory 
Input 
query 1 
query 2 
query 3 
. . . 
one-time 
processing 
10-100× faster than network and disk
Spark Model 
Resilient Distributed Datasets (RDDs) 
> Collections of objects that can be stored in memory or 
disk across a cluster 
> Built via parallel transformations (map, filter, …) 
> Fault-tolerant without replication
Example: Log Mining 
Load error messages from a log into memory, then 
interactively search for various patterns 
Base RTDraDn sformed RDD 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(‘t’)[2]) 
messages.cache() 
Block 1 
Block 2 
Block 3 
Worker 
Worker 
Worker 
Driver 
messages.filter(lambda s: “foo” in s).count() 
messages.filter(lambda s: “bar” in s).count() 
. . . 
results 
tasks 
Cache 1 
Cache 2 
Cache 3 
Action 
Full-text search of Wikipedia in <1 sec 
(vs 20 sec for on-disk data)
Fault Tolerance 
RDDs track lineage info to rebuild lost data 
file.map(lambda rec: (rec.type, 1)) 
.reduceByKey(lambda x, y: x + y) 
.filter(lambda (type, count): count > 10) 
map reduce filter 
Input file
Fault Tolerance 
RDDs track lineage info to rebuild lost data 
file.map(lambda rec: (rec.type, 1)) 
map reduce filter 
Input file 
.reduceByKey(lambda x, y: x + y) 
.filter(lambda (type, count): count > 10)
Example: Logistic Regression 
4000 
3500 
Running Time (s) Number of Iterations 
3000 
2500 
2000 
1500 
1000 
500 
0 
1 5 10 20 30 
110 s / iteration 
Hadoop 
Spark 
first iteration 80 s 
later iterations 1 s
Spark in Scala and Java 
// Scala: 
val lines = sc.textFile(...) 
lines.filter(s => s.contains(“ERROR”)).count() 
// Java: 
JavaRDD<String> lines = sc.textFile(...); 
lines.filter(s -> s.contains(“ERROR”)).count();
How General Is It?
Spark 
Streaming 
real-time 
Spark Core 
Spark SQL 
relational 
MLlib 
machine 
learning 
GraphX 
graph 
Libraries Built on Spark
Spark SQL 
Represents tables as RDDs 
Tables = Schema + Data
Spark SQL 
Represents tables as RDDs 
Tables = Schema + Data = SchemaRDD 
From Hive: 
c = HiveContext(sc) 
rows = c.sql(“select text, year from hivetable”) 
rows.filter(lambda r: r.year > 2013).collect() 
{“text”: “hi”, 
“user”: { 
“name”: “matei”, 
“id”: 123 
}} 
From JSON: 
c.jsonFile(“tweets.json”).registerTempTable(“tweets”) 
c.sql(“select text, user.name from tweets”) 
tweets.json
Spark Streaming 
Time 
Input
Spark Streaming 
Time 
RDD RDD RDD RDD RDD RDD 
Represents streams as a series of RDDs over time 
val spammers = sc.sequenceFile(“hdfs://spammers.seq”) 
sc.twitterStream(...) 
.filter(t => t.text.contains(“QCon”)) 
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers)) 
.print()
MLlib 
Vectors, Matrices
MLlib 
Vectors, Matrices = RDD[Vector] 
Iterative computation 
points = sc.textFile(“data.txt”).map(parsePoint) 
model = KMeans.train(points, 10) 
model.predict(newPoint)
GraphX 
Represents graphs as RDDs of edges and vertices
GraphX 
Represents graphs as RDDs of edges and vertices
GraphX 
Represents graphs as RDDs of edges and vertices
Combining Processing Types 
// Load data using SQL 
val points = ctx.sql( 
“select latitude, longitude from historic_tweets”) 
// Train a machine learning model 
val model = KMeans.train(points, 10) 
// Apply it to a stream 
sc.twitterStream(...) 
.map(t => (model.closestCenter(t.location), 1)) 
.reduceByWindow(“5s”, _ + _)
Composing Workloads 
Separate systems: 
. . . 
HDFS 
read 
HDFS 
write 
ETL 
HDFS 
read 
HDFS 
write 
train 
HDFS 
read 
HDFS 
write 
query 
HDFS 
write 
HDFS 
read 
ETL 
train 
query 
Spark:
Hive 
Impala (disk) 
Impala (mem) 
Spark (disk) 
Spark (mem) 
0 
10 
20 
30 
40 
50 
Response Time (sec) 
SQL 
Mahout 
GraphLab 
Spark 
0 
10 
20 
30 
40 
50 
60 
Response Time (min) 
ML 
Performance vs Specialized Systems Storm 
Spark 
0 
5 
10 
15 
20 
25 
30 
35 
Throughput (MB/s/node) 
Streaming
On-Disk Performance: Petabyte Sort 
Spark beat last year’s Sort Benchmark winner, Hadoop, 
by 3× using 10× fewer machines 
2013 Record (Hadoop) Spark 100 TB Spark 1 PB 
Data Size 102.5 TB 100 TB 1000 TB 
Time 72 min 23 min 234 min 
Nodes 2100 206 190 
Cores 50400 6592 6080 
Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min 
tinyurl.com/spark-sort
Overview 
Why a unified engine? 
Spark execution model 
Why was Spark so general? 
What’s next
Why was Spark so General? 
In a world of growing data complexity, understanding 
this can help us design new tools / pipelines 
Two perspectives: 
> Expressiveness perspective 
> Systems perspective
1. Expressiveness Perspective 
Spark ≈ MapReduce + fast data sharing
1. Expressiveness Perspective 
MapReduce can emulate any distributed system! 
How to share data! 
quickly across steps? 
Local computation 
All-to-all communication 
One MR step 
… 
Spark: RDDs 
How low is this latency? 
Spark: ~100 ms
2. Systems Perspective 
Main bottlenecks in clusters are network and I/O 
Any system that lets apps control these resources can 
match speed of specialized ones 
In Spark: 
> Users control data partitioning & caching 
> We implement the data structures and algorithms of 
specialized systems within Spark records
Examples 
Spark SQL 
> A SchemaRDD holds records for each chunk of data 
(multiple rows), with columnar compression 
GraphX 
> GraphX represents graphs as an RDD of HashMaps so 
that it can join quickly against each partition
Result 
Spark can leverage most of the latest innovations in 
databases, graph processing, machine learning, … 
Users get a single API that composes very efficiently 
More info: tinyurl.com/matei-thesis
Overview 
Why a unified engine? 
Spark execution model 
Why was Spark so general? 
What’s next
What’s Next for Spark 
While Spark has been around since 2009, many pieces 
are just beginning 
300 contributors, 2 whole libraries new this year 
Big features in the works
Spark 1.2 (Coming in Dec) 
New machine learning pipelines API 
> Featurization & parameter search, similar to SciKit-Learn 
Python API for Spark Streaming 
Spark SQL pluggable data sources 
> Hive, JSON, Parquet, Cassandra, ORC, … 
Scala 2.11 support
Beyond Hadoop 
Batch Interactive Streaming 
Hadoop Cassandra Mesos 
… 
… 
Public 
Clouds 
Your 
application 
here 
Unified API across workloads, storage systems 
and environments
Learn More 
Downloads and tutorials: spark.apache.org 
Training: databricks.com/training (free videos) 
Databricks Cloud: databricks.com/cloud
www.spark-summit.org
Unified Big Data Processing with Apache Spark

More Related Content

What's hot

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesCarole Gunst
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 

What's hot (20)

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Microsoft Fabric.pptx
Microsoft Fabric.pptxMicrosoft Fabric.pptx
Microsoft Fabric.pptx
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 

Viewers also liked

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
AWSクラウド利用料算出の参考資料
AWSクラウド利用料算出の参考資料AWSクラウド利用料算出の参考資料
AWSクラウド利用料算出の参考資料SORACOM, INC
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosPaco Nathan
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache sparksarith divakar
 
AWS Simple Monthly Calculator 操作説明書
AWS Simple Monthly Calculator 操作説明書AWS Simple Monthly Calculator 操作説明書
AWS Simple Monthly Calculator 操作説明書Amazon Web Services Japan
 

Viewers also liked (20)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
AWSクラウド利用料算出の参考資料
AWSクラウド利用料算出の参考資料AWSクラウド利用料算出の参考資料
AWSクラウド利用料算出の参考資料
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
AWS Simple Monthly Calculator 操作説明書
AWS Simple Monthly Calculator 操作説明書AWS Simple Monthly Calculator 操作説明書
AWS Simple Monthly Calculator 操作説明書
 

Similar to Unified Big Data Processing with Apache Spark

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 

Similar to Unified Big Data Processing with Apache Spark (20)

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Recently uploaded

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

Unified Big Data Processing with Apache Spark

  • 1.
  • 2. Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia
  • 3. What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing Most active open source project in big data
  • 4. About Databricks Founded by the creators of Spark in 2013 Continues to drive open source Spark development, and offers a cloud service (Databricks Cloud) Partners to support Spark with Cloudera, MapR, Hortonworks, Datastax
  • 5. Spark Community MapReduce YARN HDFS Storm Spark 2000 1500 1000 500 0 MapReduce YARN HDFS Storm Spark 350000 300000 250000 200000 150000 100000 50000 0 Commits Lines of Code Changed Activity in past 6 months
  • 6. Community Growth 100 75 50 25 0 Contributors per Month to Spark 2010 2011 2012 2013 2014 2-3x more activity than Hadoop, Storm, MongoDB, NumPy, D3, Julia, …
  • 7. Overview Why a unified engine? Spark execution model Why was Spark so general? What’s next
  • 9. MapReduce A general engine for batch processing
  • 10. Beyond MapReduce MapReduce was great for batch processing, but users quickly needed to do more: > More complex, multi-pass algorithms > More interactive ad-hoc queries > More real-time stream processing Result: many specialized systems for these workloads
  • 11. Big Data Systems Today MapReduce Pregel Giraph Presto Storm Dremel Drill Impala S4 . . . Specialized systems for new workloads General batch processing
  • 12. Problems with Specialized Systems More systems to manage, tune, deploy Can’t combine processing types in one application > Even though many pipelines need to do this! > E.g. load data with SQL, then run machine learning In many pipelines, data exchange between engines is the dominant cost!
  • 13. MapReduce Pregel Giraph Presto Storm Dremel Drill Impala S4 Specialized systems for new workloads General batch processing Unified engine Big Data Systems Today ? . . .
  • 14. Overview Why a unified engine? Spark execution model Why was Spark so general? What’s next
  • 15. Background Recall 3 workloads were issues for MapReduce: > More complex, multi-pass algorithms > More interactive ad-hoc queries > More real-time stream processing While these look different, all 3 need one thing that MapReduce lacks: efficient data sharing
  • 16. Data Sharing in MapReduce iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Slow due to data replication and disk I/O
  • 17. What We’d Like iter. 1 iter. 2 . . . Input Distributed memory Input query 1 query 2 query 3 . . . one-time processing 10-100× faster than network and disk
  • 18. Spark Model Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory or disk across a cluster > Built via parallel transformations (map, filter, …) > Fault-tolerant without replication
  • 19. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RTDraDn sformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . results tasks Cache 1 Cache 2 Cache 3 Action Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
  • 20. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file
  • 21. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) map reduce filter Input file .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
  • 22. Example: Logistic Regression 4000 3500 Running Time (s) Number of Iterations 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 110 s / iteration Hadoop Spark first iteration 80 s later iterations 1 s
  • 23. Spark in Scala and Java // Scala: val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count();
  • 25. Spark Streaming real-time Spark Core Spark SQL relational MLlib machine learning GraphX graph Libraries Built on Spark
  • 26. Spark SQL Represents tables as RDDs Tables = Schema + Data
  • 27. Spark SQL Represents tables as RDDs Tables = Schema + Data = SchemaRDD From Hive: c = HiveContext(sc) rows = c.sql(“select text, year from hivetable”) rows.filter(lambda r: r.year > 2013).collect() {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }} From JSON: c.jsonFile(“tweets.json”).registerTempTable(“tweets”) c.sql(“select text, user.name from tweets”) tweets.json
  • 29. Spark Streaming Time RDD RDD RDD RDD RDD RDD Represents streams as a series of RDDs over time val spammers = sc.sequenceFile(“hdfs://spammers.seq”) sc.twitterStream(...) .filter(t => t.text.contains(“QCon”)) .transform(tweets => tweets.map(t => (t.user, t)).join(spammers)) .print()
  • 31. MLlib Vectors, Matrices = RDD[Vector] Iterative computation points = sc.textFile(“data.txt”).map(parsePoint) model = KMeans.train(points, 10) model.predict(newPoint)
  • 32. GraphX Represents graphs as RDDs of edges and vertices
  • 33. GraphX Represents graphs as RDDs of edges and vertices
  • 34. GraphX Represents graphs as RDDs of edges and vertices
  • 35. Combining Processing Types // Load data using SQL val points = ctx.sql( “select latitude, longitude from historic_tweets”) // Train a machine learning model val model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)
  • 36. Composing Workloads Separate systems: . . . HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query HDFS write HDFS read ETL train query Spark:
  • 37. Hive Impala (disk) Impala (mem) Spark (disk) Spark (mem) 0 10 20 30 40 50 Response Time (sec) SQL Mahout GraphLab Spark 0 10 20 30 40 50 60 Response Time (min) ML Performance vs Specialized Systems Storm Spark 0 5 10 15 20 25 30 35 Throughput (MB/s/node) Streaming
  • 38. On-Disk Performance: Petabyte Sort Spark beat last year’s Sort Benchmark winner, Hadoop, by 3× using 10× fewer machines 2013 Record (Hadoop) Spark 100 TB Spark 1 PB Data Size 102.5 TB 100 TB 1000 TB Time 72 min 23 min 234 min Nodes 2100 206 190 Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min tinyurl.com/spark-sort
  • 39. Overview Why a unified engine? Spark execution model Why was Spark so general? What’s next
  • 40. Why was Spark so General? In a world of growing data complexity, understanding this can help us design new tools / pipelines Two perspectives: > Expressiveness perspective > Systems perspective
  • 41. 1. Expressiveness Perspective Spark ≈ MapReduce + fast data sharing
  • 42. 1. Expressiveness Perspective MapReduce can emulate any distributed system! How to share data! quickly across steps? Local computation All-to-all communication One MR step … Spark: RDDs How low is this latency? Spark: ~100 ms
  • 43. 2. Systems Perspective Main bottlenecks in clusters are network and I/O Any system that lets apps control these resources can match speed of specialized ones In Spark: > Users control data partitioning & caching > We implement the data structures and algorithms of specialized systems within Spark records
  • 44. Examples Spark SQL > A SchemaRDD holds records for each chunk of data (multiple rows), with columnar compression GraphX > GraphX represents graphs as an RDD of HashMaps so that it can join quickly against each partition
  • 45. Result Spark can leverage most of the latest innovations in databases, graph processing, machine learning, … Users get a single API that composes very efficiently More info: tinyurl.com/matei-thesis
  • 46. Overview Why a unified engine? Spark execution model Why was Spark so general? What’s next
  • 47. What’s Next for Spark While Spark has been around since 2009, many pieces are just beginning 300 contributors, 2 whole libraries new this year Big features in the works
  • 48. Spark 1.2 (Coming in Dec) New machine learning pipelines API > Featurization & parameter search, similar to SciKit-Learn Python API for Spark Streaming Spark SQL pluggable data sources > Hive, JSON, Parquet, Cassandra, ORC, … Scala 2.11 support
  • 49. Beyond Hadoop Batch Interactive Streaming Hadoop Cassandra Mesos … … Public Clouds Your application here Unified API across workloads, storage systems and environments
  • 50. Learn More Downloads and tutorials: spark.apache.org Training: databricks.com/training (free videos) Databricks Cloud: databricks.com/cloud