SlideShare a Scribd company logo
1 of 99
Download to read offline
Abhishek Sinha – Sr. Product Manager
sinhaar@amazon.com
@abysinha
Amazon EMR
Amazon EMR
Making it easy, secure and cost-effective to run
data processing frameworks on the AWS cloud
Amazon EMR
• Managed platform
• Hadoop MapReduce, Spark, Presto,
and more
• Launch clusters in minutes
• Apache Bigtop based distribution
• Leverage the elasticity of the cloud
• Added security features
• Pay by the hour and save with Spot
• Flexibility to customize
• Programmable Infrastructure
What do I need to build a cluster ?
1. Choose instances
2. Choose your software
3. Choose your access method
Cluster composition
Master Node
Core Instance Group Task Instance
Groups
NameNode (HDFS),
ResourceManager (YARN),
and other components
HDFS DataNode
YARN Node Manager
YARN Node Manager
Choice of multiple instances
CPU
c3 family
c4 family
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
m4 family
Machine
Learning
Batch
Processing
In-memory
(Spark &
Presto)
Large HDFS
Or add EBS volumes if you need additional on-cluster storage.
Hadoop applications available in EMR
Or, use bootstrap actions to install arbitrary
applications on your cluster!
Choose your software - Quick Create
Choose your software – Advanced Options
Configuration API for custom configs
[
{
"Classification": "core-site",
"Properties": {
"hadoop.security.groups.cache.secs": "250"
}
},
{
"Classification": "mapred-site",
"Properties": {
"mapred.tasktracker.map.tasks.maximum": "2",
"mapreduce.map.sort.spill.percent": "90",
"mapreduce.tasktracker.reduce.tasks.maximum": "5"
}
}
]
Use the AWS CLI to easily create clusters:
aws emr create-cluster
--release-label emr-4.3.0
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK for programmatic provisioning:
Use Amazon EMR to
separate your
compute and storage.
On premises: compute and storage grow together
Tightly coupled
Storage grows along with
compute
Compute requirements vary
On premises: Underutilized or scarce resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processingWeekly peaks
Steady state
On premises: Contention for same resources
Compute
bound
Memory
bound
Separation of resources creates data silos
Team A
On premises: Replication adds to cost
3x
HDFS needs 3x
Multi-Data Center DR
Use Amazon EMR to
separate your
compute and storage.
EMR can process data from many sources
• Hadoop Distributed File
System (HDFS)
• Amazon S3 (EMRFS)
• Amazon Dynamo DB,
Redshift, Aurora, RDS
• Amazon Kinesis
• Other applications running in
your architecture (Kafka,
ElasticSearch, etc.)
Amazon S3 is your persistent data store
11 9’s of durability
$0.03 / GB / Month in US-East
Life Cycle Policies
Available across AZs
Easy access
Amazon S3
The EMR Filesystem (EMRFS)
• Allows you to leverage S3 as a file-system for Hadoop
• Streams data directly from S3
• Cluster still uses local disk/HDFS for intermediates
• Better read/write performance and error handling than
open source components
• Optional consistent view for consistent list
• Support for encryption
• Fast listing of objects
Going from HDFS to S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'
Going from HDFS to S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION 's3://elasticmapreduce.samples/pig-
apache/input/'
Benefit 1: Switch off clusters
Amazon S3Amazon S3 Amazon S3
Auto-terminate clusters after job completion
You can build a pipeline
Submit jobs using:
- EMR Step API
- Oozie
- SSH directly
- Genie (Gateway)
- OSS workflow tools
(i.e. Luigi)
You can use Amazon Data Pipeline
Input data
Use EMR to transform
unstructured to
structured data
Push to S3
Ingest into
Redshift
Run transient or long-running clusters
Benefit 2: Resize your cluster to match
workload requirements
Resize using the Console, CLI, or API
Save costs with EC2 Spot instances
Bid
Price
OD
Price
Spot integration
aws emr create-cluster --name "Spot cluster" --ami-version 3.3
InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
The Spot Bid Advisor
Spot Integration with EMR
• Can provision instances from the Spot Market
• Replaces a spot instance in case of interruption
• Impact of interruption
• Master Node – Can lose the cluster
• Core Node – Can lose data stored in HDFS
• Task Nodes – lose the task (but the task will run elsewhere)
Scale up with Spot Instances
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140
Resize Nodes with Spot Instances
Add 10 more nodes on Spot
Resize Nodes with Spot Instances
20 node cluster running for 7 hours
Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105
Resize Nodes with Spot Instances
50 % less run-time ( 14  7)
25% less cost (140  105)
Intelligent scale down
Intelligent scale down – HDFS
Effectively utilize clusters
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Benefit 3: Logical separation of jobs
Hive, Pig,
Cascading
Prod
Presto Ad-Hoc
Amazon S3
Benefit 4 : Disaster recovery built-in
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone
Hive Metastore in
Amazon RDS
S3 as a data-lake
Nate Summons, Principle Architect - NASDAQ
Monitoring with CloudWatch (or Ganglia)
EMR logging to S3 makes logs easily available
Spark moves at interactive speed
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in RDDs in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark components to match your use case
Spark speaks your language
Use DataFrames to easily interact with data
• Distributed
collection of data
organized in
columns
• An extension of the
existing RDD API
• Optimized for query
execution
Easily create DataFrames from many formats
RDD
Additional libraries for Spark SQL Data Sources
at spark-packages.org
Load data with the Spark SQL Data Sources API
Additional libraries at spark-packages.org
Sample DataFrame manipulations
Use DataFrames for machine learning
• Spark ML libraries
(replacing MLlib) use
DataFrames as
input/output for
models
• Create ML pipelines
with a variety of
distributed algorithms
Create DataFrames on streaming data
• Access data in Spark Streaming DStream
• Create SQLContext on the SparkContext used for Spark
Streaming application for ad hoc queries
• Incorporate DataFrame in Spark Streaming application
• Checkpointing streaming jobs
Spark Pipeline
Use R to interact with DataFrames
• SparkR package for using R to manipulate DataFrames
• Create SparkR applications or interactively use the SparkR
shell (no Zeppelin support yet - ZEPPELIN-156)
• Comparable performance to Python and Scala
DataFrames
Amazon EMR runs Spark on YARN
• Dynamically share and centrally configure
the same pool of cluster resources across
engines
• Schedulers for categorizing, isolating, and
prioritizing workloads
• Choose the number of executors to use, or
allow YARN to choose (dynamic allocation)
• Kerberos authentication
Storage
S3, HDFS
YARN
Cluster Resource Management
Batch
MapReduce
In Memory
Spark
Applications
Pig, Hive, Cascading, Spark Streaming, Spark SQL
Inside Spark Executor on YARN
Max Container size on node
Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.)
𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟. 𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10
Executor Container
Memory
Overhead
Config File: spark-default.conf
Inside Spark Executor on YARN
Max Container size on node
Spark executor memory - Amount of memory to use per executor process
spark.executor.memory
Executor Container
Memory
Overhead
Spark Executor Memory
Config File: spark-default.conf
Inside Spark Executor on YARN
Max Container size on node
Shuffle Memory Fraction – pre-Spark 1.6
Executor Container
Memory
Overhead
Spark Executor Memory
Shuffle
memoryFraction
Default: 0.2
Inside Spark Executor on YARN
Max Container size on node
Storage storage Fraction - pre-Spark 1.6
Executor Container
Memory
Overhead
Spark Executor Memory
Shuffle
memoryFraction
Storage
memoryFraction
Default: 0.6
Inside Spark Executor on YARN
Max Container size on node
In Spark 1.6+, Spark automatically balances the amount of memory for execution
and cached data.
Executor Container
Memory
Overhead
Spark Executor Memory
Execution / Cache
Default: 0.6
Dynamic Allocation on YARN
Scaling up on executors
- Request when you want the job to complete faster
- Idle resources on cluster
- Exponential increase in executors over time
New Default beginning EMR 4.4
Dynamic allocation setup
Optional
Property Value
Spark.dynamicAllocation.enabled true
Spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 5
spark.dynamicAllocation.maxExecutors 17
spark.dynamicAllocation.initalExecutors 0
sparkdynamicAllocation.executorIdleTime 60s
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.dynamicAllocation.sustainedSchedulerBacklog
Timeout
5s
Compress your input data set
• Always compress Data Files on Amazon S3
• Reduces storage cost
• Reduces bandwidth between Amazon S3 and Amazon
EMR, which can speed up bandwidth constrained jobs
Compressions
Compression Types:
– Some are fast BUT offer less space reduction
– Some are space efficient BUT Slower
– Some are splitable and some are not
Algorithm % Space
Remaining
Encoding Speed Decoding Speed
GZIP 13% 21MB/s 118MB/s
LZO 20% 135MB/s 410MB/s
Snappy 22% 172MB/s 409MB/s
Data Serialization
• Data is serialized when cached or shuffled
Default: Java serializer
• Kyro serialization (10x faster than Java serialization)
• Does not support all Serializable types
• Register the class in advance
Usage: Set in SparkConf
conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")
Running Spark on
Amazon EMR
Focus on deriving insights from your data
instead of manually configuring clusters
Easy to install and
configure Spark
Secured
Spark submit, Oozie or
use Zeppelin UI
Quickly add
and remove capacity
Hourly, reserved, or
EC2 Spot pricing
Use S3 to decouple
compute and storage
Launch the latest Spark version
Spark 1.6.1 is the current version on EMR.
< 3 week cadence with latest open source release
Create a fully configured cluster in minutes
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
Or easily change your settings
Many storage layers to choose from
Amazon DynamoDB
EMR-DyanmoDB
connector
Amazon RDS
Amazon
Kinesis
Streaming data
connectorsJDBC Data Source
w/ Spark SQL
ElasticSearch
connector
Amazon Redshift
Spark-Redshift
connector
EMR File System
(EMRFS)
Amazon S3
Amazon EMR
Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Amazon EMR
Easy to run your Spark workloads
Amazon EMR Step API
SSH to master node and use Spark
Submit, Oozie or Zeppelin
Submit a Spark
application
Amazon EMR
Customer use cases
Some of our customers running Spark on EMR
Integration Pattern – ETL with Spark
Amazon EMRAmazon S3
HDFSRead
Unstructured
Data
Write
Structured
Extract
Load from
HDFS
Store Output Data
Integration Pattern – Tumbling Window Reporting
Amazon EMR
Amazon
Kinesis
Streaming Input
HDFS
Tumbling/Fixed
Window
Aggregation
Periodic Output
Amazon Redshift
COPY from EMR
Or checkpoint to S3 and use
the Lambda loader app
EMR Security Overview
Encryption ComplianceSecurity
Fundamentals
• Identity and Access
Management (IAM) policies,
• Bucket policies
• Access Control Lists (ACLs)
• Query string authentication
• SSL endpoints
• Server Side Encryption
(SSE-S3)
• Server Side Encryption
with KMS provided keys
(coming soon)
• Client-side Encryption
• Buckets access logs
• Lifecycle Management
Policies
• Access Control Lists
(ACLs)
• Versioning & MFA deletes
Networking: VPC private subnets
• Use Amazon S3 Endpoints for
connectivity to S3
• Use Managed NAT for connectivity to
other services or the Internet
• Control the traffic using Security Groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess
Access Control: IAM Users and Roles
• IAM Policies for access to Amazon EMR service (IAM users or
federated users)
• AmazonElasticMapReduceFullAccess
• AmazonElasticMapReduceReadOnlyAccess
• IAM Policies for Amazon EMR cluster
• Service role (AmazonElasticMapReduceRole) - Allowable
actions for Amazon EMR service, like creating EC2 instances.
• Instance profile (AmazonElasticMapReduceforEC2Role) -
Applications that run on Amazon EMR, like access to Amazon S3
for EMRFS on your cluster.
Data at Rest: S3 client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Customer Stories
AOL’s Spot Use Case: restate 6 months of
historical data
Availability Zones
10
550
EMR Clusters
24,000
Spot EC2 Instances
0
10
20
30
40
50
60
70
Timing Comparison
In-House
AWS
OUR CLOUD ARCHITECTURE
FINRA saves money with comparable
performance with Hive on Tez with S3
Using EMR and cloud capacity for ETL
Bridging on-prem and EMR for easy ETL
Twitter (Answers) uses EMR as the batch layer
in their Lambda architecture
Using EMR for batch, streaming, and ad hoc
SmartNews
Nasdaq: data lake architecture diagram
Optimizing data warehousing costs with S3 and EMR
AWS Pop-up Loft London
Thank You
Abhishek Sinha | sinhaar@amazon.com | @abysinha

More Related Content

What's hot

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon Web Services
 
Introduction to Amazon Elastic File System (EFS)
Introduction to Amazon Elastic File System (EFS)Introduction to Amazon Elastic File System (EFS)
Introduction to Amazon Elastic File System (EFS)Amazon Web Services
 
AWS Webcast - Introduction to EBS
AWS Webcast - Introduction to EBS AWS Webcast - Introduction to EBS
AWS Webcast - Introduction to EBS Amazon Web Services
 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3Yu Lun Teo
 
AWS Monitoring & Logging
AWS Monitoring & LoggingAWS Monitoring & Logging
AWS Monitoring & LoggingJason Poley
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Amazon Web Services Korea
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Amazon Elastic Compute Cloud (EC2) - Module 2 Part 1 - AWSome Day 2017
Amazon Elastic Compute Cloud (EC2) - Module 2 Part 1 - AWSome Day 2017Amazon Elastic Compute Cloud (EC2) - Module 2 Part 1 - AWSome Day 2017
Amazon Elastic Compute Cloud (EC2) - Module 2 Part 1 - AWSome Day 2017Amazon Web Services
 

What's hot (20)

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
AWS EC2
AWS EC2AWS EC2
AWS EC2
 
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
 
AWS Storage Options
AWS Storage OptionsAWS Storage Options
AWS Storage Options
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Introduction to Amazon Elastic File System (EFS)
Introduction to Amazon Elastic File System (EFS)Introduction to Amazon Elastic File System (EFS)
Introduction to Amazon Elastic File System (EFS)
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Getting Started with Amazon EC2
Getting Started with Amazon EC2Getting Started with Amazon EC2
Getting Started with Amazon EC2
 
AWS Webcast - Introduction to EBS
AWS Webcast - Introduction to EBS AWS Webcast - Introduction to EBS
AWS Webcast - Introduction to EBS
 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3
 
AWS Monitoring & Logging
AWS Monitoring & LoggingAWS Monitoring & Logging
AWS Monitoring & Logging
 
Amazon Aurora: Under the Hood
Amazon Aurora: Under the HoodAmazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
 
Machine Learning on AWS
Machine Learning on AWSMachine Learning on AWS
Machine Learning on AWS
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Amazon Elastic Compute Cloud (EC2) - Module 2 Part 1 - AWSome Day 2017
Amazon Elastic Compute Cloud (EC2) - Module 2 Part 1 - AWSome Day 2017Amazon Elastic Compute Cloud (EC2) - Module 2 Part 1 - AWSome Day 2017
Amazon Elastic Compute Cloud (EC2) - Module 2 Part 1 - AWSome Day 2017
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 

Viewers also liked

Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Amazon Web Services
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRrICh morrow
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Amazon Web Services
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon RedshiftIndicThreads
 
Recommendations with hadoop streaming and python
Recommendations with hadoop streaming and pythonRecommendations with hadoop streaming and python
Recommendations with hadoop streaming and pythonAndrew Look
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015Amazon Web Services Korea
 
Announcing AWS Batch - Run Batch Jobs At Scale - December 2016 Monthly Webina...
Announcing AWS Batch - Run Batch Jobs At Scale - December 2016 Monthly Webina...Announcing AWS Batch - Run Batch Jobs At Scale - December 2016 Monthly Webina...
Announcing AWS Batch - Run Batch Jobs At Scale - December 2016 Monthly Webina...Amazon Web Services
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...Amazon Web Services
 
AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...
AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...
AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...Amazon Web Services
 
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...Amazon Web Services
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech TalksHands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech TalksAmazon Web Services
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewDoiT International
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesDoiT International
 
AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...
AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...
AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...Amazon Web Services
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Amazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 

Viewers also liked (20)

Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 
Recommendations with hadoop streaming and python
Recommendations with hadoop streaming and pythonRecommendations with hadoop streaming and python
Recommendations with hadoop streaming and python
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
 
Announcing AWS Batch - Run Batch Jobs At Scale - December 2016 Monthly Webina...
Announcing AWS Batch - Run Batch Jobs At Scale - December 2016 Monthly Webina...Announcing AWS Batch - Run Batch Jobs At Scale - December 2016 Monthly Webina...
Announcing AWS Batch - Run Batch Jobs At Scale - December 2016 Monthly Webina...
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
 
AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...
AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...
AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...
 
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech TalksHands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech Talks
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...
AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...
AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 

Similar to Sr. Product Manager at Amazon leads EMR

Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkAmazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 

Similar to Sr. Product Manager at Amazon leads EMR (20)

Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
EMR Training
EMR TrainingEMR Training
EMR Training
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Sr. Product Manager at Amazon leads EMR

  • 1. Abhishek Sinha – Sr. Product Manager sinhaar@amazon.com @abysinha Amazon EMR
  • 2. Amazon EMR Making it easy, secure and cost-effective to run data processing frameworks on the AWS cloud
  • 3. Amazon EMR • Managed platform • Hadoop MapReduce, Spark, Presto, and more • Launch clusters in minutes • Apache Bigtop based distribution • Leverage the elasticity of the cloud • Added security features • Pay by the hour and save with Spot • Flexibility to customize • Programmable Infrastructure
  • 4. What do I need to build a cluster ? 1. Choose instances 2. Choose your software 3. Choose your access method
  • 5. Cluster composition Master Node Core Instance Group Task Instance Groups NameNode (HDFS), ResourceManager (YARN), and other components HDFS DataNode YARN Node Manager YARN Node Manager
  • 6. Choice of multiple instances CPU c3 family c4 family Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family m4 family Machine Learning Batch Processing In-memory (Spark & Presto) Large HDFS Or add EBS volumes if you need additional on-cluster storage.
  • 7. Hadoop applications available in EMR Or, use bootstrap actions to install arbitrary applications on your cluster!
  • 8. Choose your software - Quick Create
  • 9. Choose your software – Advanced Options
  • 10. Configuration API for custom configs [ { "Classification": "core-site", "Properties": { "hadoop.security.groups.cache.secs": "250" } }, { "Classification": "mapred-site", "Properties": { "mapred.tasktracker.map.tasks.maximum": "2", "mapreduce.map.sort.spill.percent": "90", "mapreduce.tasktracker.reduce.tasks.maximum": "5" } } ]
  • 11. Use the AWS CLI to easily create clusters: aws emr create-cluster --release-label emr-4.3.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge Or use your favorite SDK for programmatic provisioning:
  • 12. Use Amazon EMR to separate your compute and storage.
  • 13. On premises: compute and storage grow together Tightly coupled Storage grows along with compute Compute requirements vary
  • 14. On premises: Underutilized or scarce resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Re-processingWeekly peaks Steady state
  • 15. On premises: Contention for same resources Compute bound Memory bound
  • 16. Separation of resources creates data silos Team A
  • 17. On premises: Replication adds to cost 3x HDFS needs 3x Multi-Data Center DR
  • 18. Use Amazon EMR to separate your compute and storage.
  • 19. EMR can process data from many sources • Hadoop Distributed File System (HDFS) • Amazon S3 (EMRFS) • Amazon Dynamo DB, Redshift, Aurora, RDS • Amazon Kinesis • Other applications running in your architecture (Kafka, ElasticSearch, etc.)
  • 20. Amazon S3 is your persistent data store 11 9’s of durability $0.03 / GB / Month in US-East Life Cycle Policies Available across AZs Easy access Amazon S3
  • 21. The EMR Filesystem (EMRFS) • Allows you to leverage S3 as a file-system for Hadoop • Streams data directly from S3 • Cluster still uses local disk/HDFS for intermediates • Better read/write performance and error handling than open source components • Optional consistent view for consistent list • Support for encryption • Fast listing of objects
  • 22. Going from HDFS to S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION ‘samples/pig-apache/input/'
  • 23. Going from HDFS to S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION 's3://elasticmapreduce.samples/pig- apache/input/'
  • 24. Benefit 1: Switch off clusters Amazon S3Amazon S3 Amazon S3
  • 26. You can build a pipeline Submit jobs using: - EMR Step API - Oozie - SSH directly - Genie (Gateway) - OSS workflow tools (i.e. Luigi)
  • 27. You can use Amazon Data Pipeline Input data Use EMR to transform unstructured to structured data Push to S3 Ingest into Redshift
  • 28. Run transient or long-running clusters
  • 29. Benefit 2: Resize your cluster to match workload requirements
  • 30. Resize using the Console, CLI, or API
  • 31. Save costs with EC2 Spot instances Bid Price OD Price
  • 32. Spot integration aws emr create-cluster --name "Spot cluster" --ami-version 3.3 InstanceGroupType=MASTER, InstanceType=m3.xlarge,InstanceCount=1, InstanceGroupType=CORE, BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2 InstanceGroupType=TASK, BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
  • 33. The Spot Bid Advisor
  • 34. Spot Integration with EMR • Can provision instances from the Spot Market • Replaces a spot instance in case of interruption • Impact of interruption • Master Node – Can lose the cluster • Core Node – Can lose data stored in HDFS • Task Nodes – lose the task (but the task will run elsewhere)
  • 35. Scale up with Spot Instances 10 node cluster running for 14 hours Cost = 1.0 * 10 * 14 = $140
  • 36. Resize Nodes with Spot Instances Add 10 more nodes on Spot
  • 37. Resize Nodes with Spot Instances 20 node cluster running for 7 hours Cost = 1.0 * 10 * 7 = $70 = 0.5 * 10 * 7 = $35 Total $105
  • 38. Resize Nodes with Spot Instances 50 % less run-time ( 14  7) 25% less cost (140  105)
  • 41. Effectively utilize clusters 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  • 42. Benefit 3: Logical separation of jobs Hive, Pig, Cascading Prod Presto Ad-Hoc Amazon S3
  • 43. Benefit 4 : Disaster recovery built-in Cluster 1 Cluster 2 Cluster 3 Cluster 4 Amazon S3 Availability Zone Availability Zone Hive Metastore in Amazon RDS
  • 44. S3 as a data-lake Nate Summons, Principle Architect - NASDAQ
  • 46. EMR logging to S3 makes logs easily available
  • 47.
  • 48. Spark moves at interactive speed join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in RDDs in memory • Partitioning-aware to avoid network-intensive shuffle
  • 49. Spark components to match your use case
  • 50. Spark speaks your language
  • 51. Use DataFrames to easily interact with data • Distributed collection of data organized in columns • An extension of the existing RDD API • Optimized for query execution
  • 52. Easily create DataFrames from many formats RDD Additional libraries for Spark SQL Data Sources at spark-packages.org
  • 53. Load data with the Spark SQL Data Sources API Additional libraries at spark-packages.org
  • 55. Use DataFrames for machine learning • Spark ML libraries (replacing MLlib) use DataFrames as input/output for models • Create ML pipelines with a variety of distributed algorithms
  • 56. Create DataFrames on streaming data • Access data in Spark Streaming DStream • Create SQLContext on the SparkContext used for Spark Streaming application for ad hoc queries • Incorporate DataFrame in Spark Streaming application • Checkpointing streaming jobs
  • 58. Use R to interact with DataFrames • SparkR package for using R to manipulate DataFrames • Create SparkR applications or interactively use the SparkR shell (no Zeppelin support yet - ZEPPELIN-156) • Comparable performance to Python and Scala DataFrames
  • 59.
  • 60. Amazon EMR runs Spark on YARN • Dynamically share and centrally configure the same pool of cluster resources across engines • Schedulers for categorizing, isolating, and prioritizing workloads • Choose the number of executors to use, or allow YARN to choose (dynamic allocation) • Kerberos authentication Storage S3, HDFS YARN Cluster Resource Management Batch MapReduce In Memory Spark Applications Pig, Hive, Cascading, Spark Streaming, Spark SQL
  • 61. Inside Spark Executor on YARN Max Container size on node Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.) 𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟. 𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10 Executor Container Memory Overhead Config File: spark-default.conf
  • 62. Inside Spark Executor on YARN Max Container size on node Spark executor memory - Amount of memory to use per executor process spark.executor.memory Executor Container Memory Overhead Spark Executor Memory Config File: spark-default.conf
  • 63. Inside Spark Executor on YARN Max Container size on node Shuffle Memory Fraction – pre-Spark 1.6 Executor Container Memory Overhead Spark Executor Memory Shuffle memoryFraction Default: 0.2
  • 64. Inside Spark Executor on YARN Max Container size on node Storage storage Fraction - pre-Spark 1.6 Executor Container Memory Overhead Spark Executor Memory Shuffle memoryFraction Storage memoryFraction Default: 0.6
  • 65. Inside Spark Executor on YARN Max Container size on node In Spark 1.6+, Spark automatically balances the amount of memory for execution and cached data. Executor Container Memory Overhead Spark Executor Memory Execution / Cache Default: 0.6
  • 66. Dynamic Allocation on YARN Scaling up on executors - Request when you want the job to complete faster - Idle resources on cluster - Exponential increase in executors over time New Default beginning EMR 4.4
  • 67. Dynamic allocation setup Optional Property Value Spark.dynamicAllocation.enabled true Spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 5 spark.dynamicAllocation.maxExecutors 17 spark.dynamicAllocation.initalExecutors 0 sparkdynamicAllocation.executorIdleTime 60s spark.dynamicAllocation.schedulerBacklogTimeout 5s spark.dynamicAllocation.sustainedSchedulerBacklog Timeout 5s
  • 68. Compress your input data set • Always compress Data Files on Amazon S3 • Reduces storage cost • Reduces bandwidth between Amazon S3 and Amazon EMR, which can speed up bandwidth constrained jobs
  • 69. Compressions Compression Types: – Some are fast BUT offer less space reduction – Some are space efficient BUT Slower – Some are splitable and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s
  • 70. Data Serialization • Data is serialized when cached or shuffled Default: Java serializer • Kyro serialization (10x faster than Java serialization) • Does not support all Serializable types • Register the class in advance Usage: Set in SparkConf conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")
  • 72. Focus on deriving insights from your data instead of manually configuring clusters Easy to install and configure Spark Secured Spark submit, Oozie or use Zeppelin UI Quickly add and remove capacity Hourly, reserved, or EC2 Spot pricing Use S3 to decouple compute and storage
  • 73. Launch the latest Spark version Spark 1.6.1 is the current version on EMR. < 3 week cadence with latest open source release
  • 74. Create a fully configured cluster in minutes AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API
  • 75. Or easily change your settings
  • 76. Many storage layers to choose from Amazon DynamoDB EMR-DyanmoDB connector Amazon RDS Amazon Kinesis Streaming data connectorsJDBC Data Source w/ Spark SQL ElasticSearch connector Amazon Redshift Spark-Redshift connector EMR File System (EMRFS) Amazon S3 Amazon EMR
  • 77. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Amazon EMR
  • 78. Easy to run your Spark workloads Amazon EMR Step API SSH to master node and use Spark Submit, Oozie or Zeppelin Submit a Spark application Amazon EMR
  • 80. Some of our customers running Spark on EMR
  • 81.
  • 82. Integration Pattern – ETL with Spark Amazon EMRAmazon S3 HDFSRead Unstructured Data Write Structured Extract Load from HDFS Store Output Data
  • 83. Integration Pattern – Tumbling Window Reporting Amazon EMR Amazon Kinesis Streaming Input HDFS Tumbling/Fixed Window Aggregation Periodic Output Amazon Redshift COPY from EMR Or checkpoint to S3 and use the Lambda loader app
  • 85. Encryption ComplianceSecurity Fundamentals • Identity and Access Management (IAM) policies, • Bucket policies • Access Control Lists (ACLs) • Query string authentication • SSL endpoints • Server Side Encryption (SSE-S3) • Server Side Encryption with KMS provided keys (coming soon) • Client-side Encryption • Buckets access logs • Lifecycle Management Policies • Access Control Lists (ACLs) • Versioning & MFA deletes
  • 86. Networking: VPC private subnets • Use Amazon S3 Endpoints for connectivity to S3 • Use Managed NAT for connectivity to other services or the Internet • Control the traffic using Security Groups • ElasticMapReduce-Master-Private • ElasticMapReduce-Slave-Private • ElasticMapReduce-ServiceAccess
  • 87. Access Control: IAM Users and Roles • IAM Policies for access to Amazon EMR service (IAM users or federated users) • AmazonElasticMapReduceFullAccess • AmazonElasticMapReduceReadOnlyAccess • IAM Policies for Amazon EMR cluster • Service role (AmazonElasticMapReduceRole) - Allowable actions for Amazon EMR service, like creating EC2 instances. • Instance profile (AmazonElasticMapReduceforEC2Role) - Applications that run on Amazon EMR, like access to Amazon S3 for EMRFS on your cluster.
  • 88. Data at Rest: S3 client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 90. AOL’s Spot Use Case: restate 6 months of historical data Availability Zones 10 550 EMR Clusters 24,000 Spot EC2 Instances 0 10 20 30 40 50 60 70 Timing Comparison In-House AWS
  • 92. FINRA saves money with comparable performance with Hive on Tez with S3
  • 93. Using EMR and cloud capacity for ETL
  • 94. Bridging on-prem and EMR for easy ETL
  • 95.
  • 96. Twitter (Answers) uses EMR as the batch layer in their Lambda architecture
  • 97. Using EMR for batch, streaming, and ad hoc SmartNews
  • 98. Nasdaq: data lake architecture diagram Optimizing data warehousing costs with S3 and EMR
  • 99. AWS Pop-up Loft London Thank You Abhishek Sinha | sinhaar@amazon.com | @abysinha