SlideShare a Scribd company logo
1 of 60
Download to read offline
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices for Using
Apache Spark on AWS
Jonathan Fritz, Amazon EMR
Senior Product Manager
July 13, 2016
Agenda
• Why Apache Spark?
• Deploying Spark with Amazon EMR
• EMRFS and connectivity to AWS data stores
• Spark on YARN and DataFrames
• Spark security overview
Spark moves at interactive speed
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark components to match your use case
Spark speaks your language
Use DataFrames to easily interact with data
• Distributed collection
of data organized in
columns
• Abstraction for
selecting, filtering,
aggregating, and
plotting structured data
• More optimized for
query execution than
RDDs from Catalyst
query planner
• Datasets introduced in
Spark 1.6
Datasets and DataFrames
• Datasets are an extension of the DataFrames API
(preview in Spark 1.6)
• Object-oriented operations (similar to RDD API)
• Utilizes Catalyst query planner
• Optimized encoders which increase performance and
minimize serialization/deserialization overhead
• Compile-time type safety for more robust applications
Track the transformations used to build them (their lineage)
to recompute lost data
E.g.:
DataFrames/Datasets/RDDs and fault tolerance
messages = textFile(...).filter(lambda s: s.contains(“ERROR”))
.map(lambda s: s.split(‘t’)[2])
HadoopRDD
path = hdfs://…
FilteredRDD
func = contains(...)
MappedRDD
func = split(…)
Easily create DataFrames from many formats
RDD
Additional libraries for Spark SQL Data Sources
at spark-packages.org
Load data with the Spark SQL Data Sources API
Amazon Redshift
Additional libraries at spark-packages.org
Use DataFrames for machine learning
• Spark ML libraries
(replacing MLlib) use
DataFrame API as
input/output for models
instead of RDDs
• Create ML pipelines with
a variety of distributed
algorithms
• Pipeline persistence to
save and load models
and full pipelines to
Amazon S3
Create DataFrames on streaming data
• Access data in Spark Streaming DStream
• Create SQLContext on the SparkContext used for Spark Streaming
application for ad hoc queries
• Incorporate DataFrame in Spark Streaming application
• Checkpoint streaming jobs for disaster recovery
Use R to interact with DataFrames
• SparkR package for using R to manipulate DataFrames
• Create SparkR applications or interactively use the SparkR
shell (SparkR support in Zeppelin 0.6.0—coming soon in EMR)
• Comparable performance to Python and Scala DataFrames
Spark SQL
• Seamlessly mix SQL with Spark programs
• Uniform data access
• Can interact with tables in Hive metastore
• Hive compatibility—run Hive queries without modifications
using HiveContext
• Connect through JDBC/ODBC using the Spark Thrift server
Coming soon on Amazon EMR
2.0
Spark 2.0—API updates
• Increased SQL support with new ANSI SQL parser
• Unifying the DataFrame and Dataset API
• Combining SparkContext and HiveContext
• Additional distributed algorithms in SparkR, including K-
Means, Generalized Linear Models, and Naive Bayes
• ML pipeline persistence is now supported across all
languages
Spark 2.0—Performance enhancements
• Second generation Tungsten engine
• Whole-stage code generation to create optimized
bytecode at run time
• Improvements to Catalyst optimizer for query
performance
• New vectorized Parquet decoder to increase throughput
Spark 2.0—Structured streaming
• New approach to stream data processing
• Structured Streaming API is an extension to the
DataFrame/Dataset API (no more DStream)
• Better merges processing on static and streaming
datasets
• Leverages Spark to use incremental execution if data is
a dynamic stream, and begins to abstract the nature of
the data from the developer
Creating Spark clusters
with Amazon EMR
Focus on deriving insights from your data
instead of manually configuring clusters
Easy to install and
configure Spark
Secured
Spark submit, Apache
Oozie, or use Zeppelin UI
Quickly add
and remove capacity
Hourly, reserved, or
Amazon EC2 Spot pricing
Use S3 to decouple
compute and storage
Create a fully configured cluster with the
latest version of Spark in minutes
AWS Management
Console
AWS Command Line
Interface (AWS CLI)
Or use a AWS SDK directly with the Amazon EMR API
Choice of multiple instances
CPU
c3 family
c4 family
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
(or just add EBS
to another
instance type)
General
m1 family
m3 family
m4 family
Machine
learning
Batch
processing
Cache large
DataFrames
Large HDFS
Or use EC2 Spot Instances to save up to 90%
on your compute costs.
Options to submit Spark Jobs—off cluster
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Options to submit Spark jobs—on cluster
Web UIs: Zeppelin notebooks,
R Studio, and more!
Connect with ODBC / JDBC
using the Spark Thrift server
Use Spark Actions in your Apache Oozie
workflow to create DAGs of Spark jobs.
(start using
start-thriftserver.sh)
Other:
- Use the Spark Job Server for a
REST interface and shared
DataFrames across jobs
- Use the Spark shell on your cluster
Monitoring and Debugging
• Log pushing to S3
• Logs produced by driver and executors on each node
• Can browse through log folders in EMR console
• Spark UI
• Job performance, task breakdown of jobs, information about
cached DataFrames, and more
• Ganglia monitoring
• Amazon CloudWatch metrics in the EMR console
Some of our customers running Spark on EMR
Amazon.com personalization team using Spark + DSSTNE
http://blogs.aws.amazon.com/bigdata/
A quick look at Zeppelin and
the Spark UI
Using Amazon S3 as persistent
storage for Spark
Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
EMR Filesystem (EMRFS)
• S3 connector for EMR (implements the Hadoop
FileSystem interface)
• Improved performance and error handling options
• Transparent to applications—just read/write to “s3://”
• Consistent view feature set for consistent list
• Support for S3 server-side and client-side encryption
• Faster listing using EMRFS metadata
Partitions, compression, and file formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads
Use Amazon RDS for an external Hive metastore
Amazon RDS
Hive metastore with
schema for tables in S3
Amazon S3Set metastore
location in hive-site
Using Spark with other
data stores in AWS
Many storage layers to choose from
Amazon DynamoDB
EMR-DynamoDB
connector
Amazon RDS Amazon Kinesis
Streaming data
connectors
JDBC data source
w/ Spark SQL
ElasticSearch
connector
Amazon Redshift
Spark–Amazon Redshift
connector
EMR File System
(EMRFS)
Amazon S3
Amazon EMR
Spark architecture
• Run Spark driver in
client or cluster mode
• Spark application runs
as a YARN application
• SparkContext runs as a
library in your program,
one instance per Spark
application
• Spark Executors run in
YARN Containers on
NodeManagers in your
cluster
Amazon EMR runs Spark on YARN
• Dynamically share and centrally
configure the same pool of cluster
resources across engines
• Schedulers for categorizing, isolating,
and prioritizing workloads
• Choose the number of executors to use,
or allow YARN to choose (dynamic
allocation)
• Kerberos authentication
Storage
S3, HDFS
YARN
Cluster Resource Management
Batch
MapReduce
In Memory
Spark
Applications
Pig, Hive, Cascading, Spark Streaming, Spark SQL
YARN schedulers—CapacityScheduler
• Default scheduler specified in Amazon EMR
• Queues
• Single queue is set by default
• Can create additional queues for workloads based on
multitenancy requirements
• Capacity guarantees
• Set minimal resources for each queue
• Programmatically assign free resources to queues
• Adjust these settings using the classification capacity-
scheduler in an EMR configuration object
What is a Spark executor?
• Processes that store data and run tasks for your Spark
application
• Specific to a single Spark application (no shared
executors across applications)
• Executors run in YARN containers managed by YARN
NodeManager daemons
Inside Spark executor on YARN
Max container size on node
yarn.nodemanager.resource.memory-mb (classification: yarn-site)
• Controls the maximum sum of memory used by YARN container(s)
• EMR sets this value on each node based on instance type
Max container size on node
Inside Spark executor on YARN
• Executor containers are created on each node
Executor container
Max container size on node
Executor container
Inside Spark executor on YARN
Memory
overhead
spark.yarn.executor.memoryOverhead (classification: spark-default)
• Off-heap memory (VM overheads, interned strings, etc.)
• Roughly 10% of container size
Max container size on node
Executor container
Memory
overhead
Inside Spark executor on YARN
Spark executor memory
spark.executor.memory (classification: spark-default)
• Amount of memory to use per executor process
• EMR sets this based on the instance family selected for core nodes
• Cannot have different sized executors in the same Spark application
Max container size on node
Executor container
Memory
overhead
Spark executor memory
Inside Spark executor on YARN
Execution/cache
spark.memory.fraction (classification: spark-default)
• Programmatically manages memory for execution and storage
• spark.memory.storageFraction sets percentage storage immune to eviction
• Before Spark 1.6: manually set spark.shuffle.memoryFraction and
spark.storage.memoryFraction
Configuring executors—dynamic allocation
• Optimal resource utilization
• YARN dynamically creates and shuts down executors
based on the resource needs of the Spark application
• Spark uses the executor memory and executor cores
settings in the configuration for each executor
• Amazon EMR uses dynamic allocation by default (emr-
4.5 and later), and calculates the default executor size to
use based on the instance family of your core group
Properties related to dynamic allocation
Property Value
Spark.dynamicAllocation.enabled true
Spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 5
spark.dynamicAllocation.maxExecutors 17
spark.dynamicAllocation.initalExecutors 0
sparkdynamicAllocation.executorIdleTime 60s
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.dynamicAllocation.sustainedSchedulerBackl
ogTimeout
5s
Optional
Easily override spark-defaults
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "15g",
"spark.executor.cores": "4"
}
}
]
EMR console:
Configuration object:
Configuration precedence: (1) SparkConf object, (2) flags passed to Spark Submit, (3) spark-defaults.conf
When to set executor configuration
• Need to fit larger partitions in memory
• GC is too high (though this is being resolved in Spark
1.5+ through work in Project Tungsten)
• Long-running, single tenant Spark Applications
• Static executors recommended for Spark Streaming
• Could be good for multitenancy, depending on YARN
scheduler being used
More options for executor configuration
• When creating your cluster, specify
maximizeResourceAllocation to create one large
executor per node. Spark will use all of the executors for
each application submitted.
• Adjust the Spark executor settings using an EMR
configuration object when creating your cluster
• Pass in configuration overrides when running your Spark
application with spark-submit
DataFrames
Minimize data being read in the DataFrame
• Use columnar forms like Parquet to scan less data
• More partitions give you more parallelism
• Automatic partition discovery when using Parquet
• Can repartition a DataFrame
• Also you can adjust parallelism using with
spark.default.parallelism
• Cache DataFrames in memory (StorageLevel)
• Small datasets: MEMORY_ONLY
• Larger datasets: MEMORY_AND_DISK_ONLY
For DataFrames: data serialization
• Data is serialized when cached or shuffled
Default: Java serializer
• Kyro serialization (10x faster than Java serialization)
• Does not support all Serializable types
• Register the class in advance
Usage: Set in SparkConf
conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")
Spark security on
Amazon EMR
VPC private subnets to isolate network
• Use Amazon S3 endpoints for
connectivity to S3
• Use Managed NAT for connectivity to
other services or the Internet
• Control the traffic using security groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess
Spark on EMR security overview
Encryption At-Rest
• HDFS transparent encryption (AES 256)
• Local disk encryption for temporary files using LUKS encryption
• EMRFS support for S3 client-side and server-side encryption
Encryption In-Flight
• Secure communication with SSL from S3 to EC2 (nodes of cluster)
• HDFS blocks encrypted in-transit when using HDFS encryption
• SASL encryption (digest-MD5) for Spark Shuffle
Permissions
• AWS Identity and Access Management (IAM) roles, Kerberos, and IAM users
Access
• VPC private subnet support, security groups, and SSH keys
Auditing
• AWS CloudTrail and S3 object-level auditing
Amazon S3
Thank you!
Jonathan Fritz - jonfritz@amazon.com
https://aws.amazon.com/emr/spark
http://blogs.aws.amazon.com/bigdata

More Related Content

What's hot

Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 

What's hot (20)

Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 

Viewers also liked

AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5RocksAmazon Web Services Korea
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입Hoon Park
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기Brian Hong
 
Ad-Tech on AWS 세미나 | AWS와 데이터 분석
Ad-Tech on AWS 세미나 | AWS와 데이터 분석Ad-Tech on AWS 세미나 | AWS와 데이터 분석
Ad-Tech on AWS 세미나 | AWS와 데이터 분석Amazon Web Services Korea
 

Viewers also liked (9)

AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기
 
Ad-Tech on AWS 세미나 | AWS와 데이터 분석
Ad-Tech on AWS 세미나 | AWS와 데이터 분석Ad-Tech on AWS 세미나 | AWS와 데이터 분석
Ad-Tech on AWS 세미나 | AWS와 데이터 분석
 

Similar to Best Practices for Using Apache Spark on AWS

Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkAmazon Web Services
 
Databases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapDatabases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapSungmin Kim
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...Amazon Web Services
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 

Similar to Best Practices for Using Apache Spark on AWS (20)

Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
Databases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapDatabases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 Recap
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Best Practices for Using Apache Spark on AWS

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices for Using Apache Spark on AWS Jonathan Fritz, Amazon EMR Senior Product Manager July 13, 2016
  • 2. Agenda • Why Apache Spark? • Deploying Spark with Amazon EMR • EMRFS and connectivity to AWS data stores • Spark on YARN and DataFrames • Spark security overview
  • 3.
  • 4. Spark moves at interactive speed join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in DataFrames in memory • Partitioning-aware to avoid network-intensive shuffle
  • 5. Spark components to match your use case
  • 6. Spark speaks your language
  • 7. Use DataFrames to easily interact with data • Distributed collection of data organized in columns • Abstraction for selecting, filtering, aggregating, and plotting structured data • More optimized for query execution than RDDs from Catalyst query planner • Datasets introduced in Spark 1.6
  • 8. Datasets and DataFrames • Datasets are an extension of the DataFrames API (preview in Spark 1.6) • Object-oriented operations (similar to RDD API) • Utilizes Catalyst query planner • Optimized encoders which increase performance and minimize serialization/deserialization overhead • Compile-time type safety for more robust applications
  • 9. Track the transformations used to build them (their lineage) to recompute lost data E.g.: DataFrames/Datasets/RDDs and fault tolerance messages = textFile(...).filter(lambda s: s.contains(“ERROR”)) .map(lambda s: s.split(‘t’)[2]) HadoopRDD path = hdfs://… FilteredRDD func = contains(...) MappedRDD func = split(…)
  • 10. Easily create DataFrames from many formats RDD Additional libraries for Spark SQL Data Sources at spark-packages.org
  • 11. Load data with the Spark SQL Data Sources API Amazon Redshift Additional libraries at spark-packages.org
  • 12. Use DataFrames for machine learning • Spark ML libraries (replacing MLlib) use DataFrame API as input/output for models instead of RDDs • Create ML pipelines with a variety of distributed algorithms • Pipeline persistence to save and load models and full pipelines to Amazon S3
  • 13. Create DataFrames on streaming data • Access data in Spark Streaming DStream • Create SQLContext on the SparkContext used for Spark Streaming application for ad hoc queries • Incorporate DataFrame in Spark Streaming application • Checkpoint streaming jobs for disaster recovery
  • 14. Use R to interact with DataFrames • SparkR package for using R to manipulate DataFrames • Create SparkR applications or interactively use the SparkR shell (SparkR support in Zeppelin 0.6.0—coming soon in EMR) • Comparable performance to Python and Scala DataFrames
  • 15. Spark SQL • Seamlessly mix SQL with Spark programs • Uniform data access • Can interact with tables in Hive metastore • Hive compatibility—run Hive queries without modifications using HiveContext • Connect through JDBC/ODBC using the Spark Thrift server
  • 16. Coming soon on Amazon EMR 2.0
  • 17. Spark 2.0—API updates • Increased SQL support with new ANSI SQL parser • Unifying the DataFrame and Dataset API • Combining SparkContext and HiveContext • Additional distributed algorithms in SparkR, including K- Means, Generalized Linear Models, and Naive Bayes • ML pipeline persistence is now supported across all languages
  • 18. Spark 2.0—Performance enhancements • Second generation Tungsten engine • Whole-stage code generation to create optimized bytecode at run time • Improvements to Catalyst optimizer for query performance • New vectorized Parquet decoder to increase throughput
  • 19. Spark 2.0—Structured streaming • New approach to stream data processing • Structured Streaming API is an extension to the DataFrame/Dataset API (no more DStream) • Better merges processing on static and streaming datasets • Leverages Spark to use incremental execution if data is a dynamic stream, and begins to abstract the nature of the data from the developer
  • 21. Focus on deriving insights from your data instead of manually configuring clusters Easy to install and configure Spark Secured Spark submit, Apache Oozie, or use Zeppelin UI Quickly add and remove capacity Hourly, reserved, or Amazon EC2 Spot pricing Use S3 to decouple compute and storage
  • 22. Create a fully configured cluster with the latest version of Spark in minutes AWS Management Console AWS Command Line Interface (AWS CLI) Or use a AWS SDK directly with the Amazon EMR API
  • 23. Choice of multiple instances CPU c3 family c4 family Memory m2 family r3 family Disk/IO d2 family i2 family (or just add EBS to another instance type) General m1 family m3 family m4 family Machine learning Batch processing Cache large DataFrames Large HDFS Or use EC2 Spot Instances to save up to 90% on your compute costs.
  • 24. Options to submit Spark Jobs—off cluster Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster
  • 25. Options to submit Spark jobs—on cluster Web UIs: Zeppelin notebooks, R Studio, and more! Connect with ODBC / JDBC using the Spark Thrift server Use Spark Actions in your Apache Oozie workflow to create DAGs of Spark jobs. (start using start-thriftserver.sh) Other: - Use the Spark Job Server for a REST interface and shared DataFrames across jobs - Use the Spark shell on your cluster
  • 26. Monitoring and Debugging • Log pushing to S3 • Logs produced by driver and executors on each node • Can browse through log folders in EMR console • Spark UI • Job performance, task breakdown of jobs, information about cached DataFrames, and more • Ganglia monitoring • Amazon CloudWatch metrics in the EMR console
  • 27. Some of our customers running Spark on EMR
  • 28.
  • 29.
  • 30. Amazon.com personalization team using Spark + DSSTNE http://blogs.aws.amazon.com/bigdata/
  • 31. A quick look at Zeppelin and the Spark UI
  • 32. Using Amazon S3 as persistent storage for Spark
  • 33. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  • 34. EMR Filesystem (EMRFS) • S3 connector for EMR (implements the Hadoop FileSystem interface) • Improved performance and error handling options • Transparent to applications—just read/write to “s3://” • Consistent view feature set for consistent list • Support for S3 server-side and client-side encryption • Faster listing using EMRFS metadata
  • 35. Partitions, compression, and file formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  • 36. Use Amazon RDS for an external Hive metastore Amazon RDS Hive metastore with schema for tables in S3 Amazon S3Set metastore location in hive-site
  • 37. Using Spark with other data stores in AWS
  • 38. Many storage layers to choose from Amazon DynamoDB EMR-DynamoDB connector Amazon RDS Amazon Kinesis Streaming data connectors JDBC data source w/ Spark SQL ElasticSearch connector Amazon Redshift Spark–Amazon Redshift connector EMR File System (EMRFS) Amazon S3 Amazon EMR
  • 40. • Run Spark driver in client or cluster mode • Spark application runs as a YARN application • SparkContext runs as a library in your program, one instance per Spark application • Spark Executors run in YARN Containers on NodeManagers in your cluster
  • 41. Amazon EMR runs Spark on YARN • Dynamically share and centrally configure the same pool of cluster resources across engines • Schedulers for categorizing, isolating, and prioritizing workloads • Choose the number of executors to use, or allow YARN to choose (dynamic allocation) • Kerberos authentication Storage S3, HDFS YARN Cluster Resource Management Batch MapReduce In Memory Spark Applications Pig, Hive, Cascading, Spark Streaming, Spark SQL
  • 42. YARN schedulers—CapacityScheduler • Default scheduler specified in Amazon EMR • Queues • Single queue is set by default • Can create additional queues for workloads based on multitenancy requirements • Capacity guarantees • Set minimal resources for each queue • Programmatically assign free resources to queues • Adjust these settings using the classification capacity- scheduler in an EMR configuration object
  • 43. What is a Spark executor? • Processes that store data and run tasks for your Spark application • Specific to a single Spark application (no shared executors across applications) • Executors run in YARN containers managed by YARN NodeManager daemons
  • 44. Inside Spark executor on YARN Max container size on node yarn.nodemanager.resource.memory-mb (classification: yarn-site) • Controls the maximum sum of memory used by YARN container(s) • EMR sets this value on each node based on instance type
  • 45. Max container size on node Inside Spark executor on YARN • Executor containers are created on each node Executor container
  • 46. Max container size on node Executor container Inside Spark executor on YARN Memory overhead spark.yarn.executor.memoryOverhead (classification: spark-default) • Off-heap memory (VM overheads, interned strings, etc.) • Roughly 10% of container size
  • 47. Max container size on node Executor container Memory overhead Inside Spark executor on YARN Spark executor memory spark.executor.memory (classification: spark-default) • Amount of memory to use per executor process • EMR sets this based on the instance family selected for core nodes • Cannot have different sized executors in the same Spark application
  • 48. Max container size on node Executor container Memory overhead Spark executor memory Inside Spark executor on YARN Execution/cache spark.memory.fraction (classification: spark-default) • Programmatically manages memory for execution and storage • spark.memory.storageFraction sets percentage storage immune to eviction • Before Spark 1.6: manually set spark.shuffle.memoryFraction and spark.storage.memoryFraction
  • 49. Configuring executors—dynamic allocation • Optimal resource utilization • YARN dynamically creates and shuts down executors based on the resource needs of the Spark application • Spark uses the executor memory and executor cores settings in the configuration for each executor • Amazon EMR uses dynamic allocation by default (emr- 4.5 and later), and calculates the default executor size to use based on the instance family of your core group
  • 50. Properties related to dynamic allocation Property Value Spark.dynamicAllocation.enabled true Spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 5 spark.dynamicAllocation.maxExecutors 17 spark.dynamicAllocation.initalExecutors 0 sparkdynamicAllocation.executorIdleTime 60s spark.dynamicAllocation.schedulerBacklogTimeout 5s spark.dynamicAllocation.sustainedSchedulerBackl ogTimeout 5s Optional
  • 51. Easily override spark-defaults [ { "Classification": "spark-defaults", "Properties": { "spark.executor.memory": "15g", "spark.executor.cores": "4" } } ] EMR console: Configuration object: Configuration precedence: (1) SparkConf object, (2) flags passed to Spark Submit, (3) spark-defaults.conf
  • 52. When to set executor configuration • Need to fit larger partitions in memory • GC is too high (though this is being resolved in Spark 1.5+ through work in Project Tungsten) • Long-running, single tenant Spark Applications • Static executors recommended for Spark Streaming • Could be good for multitenancy, depending on YARN scheduler being used
  • 53. More options for executor configuration • When creating your cluster, specify maximizeResourceAllocation to create one large executor per node. Spark will use all of the executors for each application submitted. • Adjust the Spark executor settings using an EMR configuration object when creating your cluster • Pass in configuration overrides when running your Spark application with spark-submit
  • 55. Minimize data being read in the DataFrame • Use columnar forms like Parquet to scan less data • More partitions give you more parallelism • Automatic partition discovery when using Parquet • Can repartition a DataFrame • Also you can adjust parallelism using with spark.default.parallelism • Cache DataFrames in memory (StorageLevel) • Small datasets: MEMORY_ONLY • Larger datasets: MEMORY_AND_DISK_ONLY
  • 56. For DataFrames: data serialization • Data is serialized when cached or shuffled Default: Java serializer • Kyro serialization (10x faster than Java serialization) • Does not support all Serializable types • Register the class in advance Usage: Set in SparkConf conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")
  • 58. VPC private subnets to isolate network • Use Amazon S3 endpoints for connectivity to S3 • Use Managed NAT for connectivity to other services or the Internet • Control the traffic using security groups • ElasticMapReduce-Master-Private • ElasticMapReduce-Slave-Private • ElasticMapReduce-ServiceAccess
  • 59. Spark on EMR security overview Encryption At-Rest • HDFS transparent encryption (AES 256) • Local disk encryption for temporary files using LUKS encryption • EMRFS support for S3 client-side and server-side encryption Encryption In-Flight • Secure communication with SSL from S3 to EC2 (nodes of cluster) • HDFS blocks encrypted in-transit when using HDFS encryption • SASL encryption (digest-MD5) for Spark Shuffle Permissions • AWS Identity and Access Management (IAM) roles, Kerberos, and IAM users Access • VPC private subnet support, security groups, and SSH keys Auditing • AWS CloudTrail and S3 object-level auditing Amazon S3
  • 60. Thank you! Jonathan Fritz - jonfritz@amazon.com https://aws.amazon.com/emr/spark http://blogs.aws.amazon.com/bigdata