SlideShare a Scribd company logo
1 of 58
Download to read offline
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz, Sr. Product Manager, Amazon EMR
Naveen Avalareddy, Sr. Principal Architect, Asurion
November 29, 2016
BDM401
Deep Dive: Amazon EMR
Best Practices & Design Patterns
What to expect from the session
• Overview of Apache ecosystem on EMR
• Using EMR with S3 and other AWS services
• Securing your EMR application stack
• Lowering costs with Auto Scaling and Spot Instances
• Building a data lake with EMR at Asurion
• Q&A
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop
HBase/Phoenix
Presto
Hue (SQL Interface/Metastore Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2/Spark Thriftserver
(JDBC/ODBC)
Amazon EMR service
Amazon EMR release
Streaming
Flink
YARN overview
• Dynamically share and
centrally configure the same
pool of cluster resources
across engines
• Schedulers for categorizing,
isolating, and prioritizing
workloads
• Spark dynamic allocation of
executors
• Kerberos authentication
YARN schedulers - CapacityScheduler
• Default scheduler specified in Amazon EMR
• Queues
• Single queue is set by default
• Can create additional queues for workloads based on
multitenancy requirements
• Capacity guarantees
• set minimal resources for each queue
• Programmatically assign free resources to queues
• Adjust these settings using the classification capacity-
scheduler in an EMR configuration object
On-cluster UIs
Manage applications
SQL editor, Workflow designer,
Metastore browser
Notebooks
Design and execute
queries and workloads
And more using
bootstrap actions!
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Amazon Kinesis
Amazon
EMR
Amazon EMR
Amazon Redshift
Elasticsearch
Clickstream
Use RDS/Aurora for an external Hive metastore
Amazon Aurora
Hive Metastore for
external tables on S3
Amazon S3Set metastore
location in hive-site
S3 tips: Partitions, compression, and file formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads
New – Run HBase on S3
FINRA saved 60% by moving to HBase on EMR
Security - Configuring VPC subnets
• Use Amazon S3 endpoints in VPC for
connectivity to S3
• Use managed NAT for connectivity to
other services or the Internet
• Control the traffic using security groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess
Access control by cluster tag and IAM roles
Tag: user = MyUserIAM user: MyUser
EMR role
EC2 role
SSH key
Fine-grained access control with Apache Ranger
• Plug-ins for Hive, HBase,
YARN, and HDFS
• Configure Hue and Zeppelin
authentication with LDAP/AD
• Row-level authorization for Hive
(with data-masking)
• Full auditing capabilities with
embedded search
• Run Ranger on an edge node –
visit the AWS Big Data Blog
Encryption – Use security configurations
Supported encryption features vary by EMR release
Nasdaq uses Presto on Amazon EMR and
Amazon Redshift as a tiered data lake
Full Presentation: https://www.youtube.com/watch?v=LuHxnOQarXU
New - Auto Scaling
Coming Soon: Advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support
21© Asurion. All rights reserved.
Building a data lake at Asurion
• Introduction
• Logical Architecture
• S3EMR - Processing
• Low Latency Query – Data Analysts
• Sandbox – Support Data Scientist
• Auto Scaling – Cost Efficiency
• Security
• Summary / take away points
22© Asurion. All rights reserved.
Asurion’s continuous innovation is helping 290M customers globally stay
connected while driving loyalty to our partners’ brands
• Founded in the mid 1990’s, Asurion has been serving the communications and retail industries for over 20 years
• Based in Nashville, Tennessee, Asurion has over 17,000 associates worldwide
• Serving more then 290 million consumers globally through our operations in 18 countries:
• Asurion is privately-held with annual revenues in excess of $5.8 billion
• Our management team comes from best-in-class companies with experience across mobile, wireline telecom, logistics, insurance, service
contracts, consulting, customer care, marketing, retail and more
• Asurion partners with the worlds leading mobile carriers, retailers cable satellite and cable providers.
North America
• Global Headquarters
• 15 Corporate Owned
Call Centers
• Logistics Center
South America
• 2 Corporate Offices
Europe
• 3 Corporate Offices
• 1 Corporate Owned Call Center
Asia Pacific
• 13 Corporate Offices
• Logistics Center
• 2 Corporate Owned
Call Centers
• Australia
• Brazil
• Canada
• China/Hong-Kong
• Colombia
• England
• France
• Israel
• Japan
• Korea
• Malaysia
• Mexico
• Philippines
• Peru
• Singapore
• Taiwan
• Thailand
• United States
Expanding Global Presence
Corporate Overview
23© Asurion. All rights reserved.
1
3
8
2015 2016 2017
Claims
Telemetry
Web
Telephony
Social
10 Billion interactions
anually
52
million voice
interactions annually
24 million claims
annually
24 million online visitors
annually
A wide variety of
data…
…at a large scale… …that continues to
accelerate
290 million customers
worldwide
Petabytes (>1 million
GB) by EOY
Structured
Unstructured
Growing Data From Multiple Sources…
24© Asurion. All rights reserved.
Data Platform Solution Guiding Principles
• Store all enterprise data & enable user access
• ELT instead of ETL. Transform JIT
• Data quality, an absolute must
• Tackle data security at the foundation level
• Embrace variability, velocity, and volume
• Focus on value, agility, and flexible delivery
• Scale On Demand without manual intervention
• Pay as you go
Platform-as-a-Service based core architecture
25© Asurion. All rights reserved.
Data Platform Solution Logical Architecture
Non-Production Environments Production Conformed
Storage
ConsumptionEnterprise Data Platform - Storage & Processing
Production S3
Storage
Real-time
Data
Processing
Sourcing
Data MartsDataAccessLayer
DataAccessLayer
Legacy
Development
L1 - Raw storage
L3 - Integrated &
conformed
Self Svc
Reports &
Analysis
L2 - Cleansed &
standardized
AccessVirtualization
Orchestration – Jobs, Plans, Workflows – Enterprise Scheduler – Information Lifecycle Management
Data
Collection
Services
Standard
Reports &
Dashboards
Information
Services
Engine
Real-time
Information
Services
Request /
ResponseData Marts
Data Marts
Customer
Hub
L4 – Meta Data
ODBC /
JDBC
Change
Data
Capture
(CDC)
Event
Streaming
Encryption
API
Analytic
Sandbox
QA
OLTP Domain
Applications
Other
Applications
Application
Events
Domain
Applications
Application
Events
Application
Events &
Logging
OLTP Domain
Applications
OLTP Domain
Applications
Other
Applications
Other
Applications
Domain
Applications
Domain
Applications
Production OLTP
systems, events and
other data sources
Environment
Administration Systems
using Metadata
Data storage optimized for consumption--
Amazon Redshift, Dynamo DB, RDS or others
as required
Data Management storage and transformation
-- Amazon EMR using S3 Storage (PaaS)
Development, Business Sandbox and QA
environments for creation of new capabilities
Business Intelligence,
Analytics and Information
Services
Data Extraction, Conversion
& Encryption
Data Virtualization manages user access to all
data controlled by Active Directory and User
Profiles
26© Asurion. All rights reserved.
Logical Architecture – Amazon Web Services
Non-Production Environments Production Conformed
Storage
ConsumptionEnterprise Data Platform - Storage & Processing
Production S3
Storage
Real-time
Data
Processing
Sourcing
Data MartsDataAccessLayer
DataAccessLayer
Legacy
Development
L1 - Raw storage
L3 - Integrated &
conformed
Self Svc
Reports &
Analysis
L2 - Cleansed &
standardized
AccessVirtualization
Orchestration – Jobs, Plans, Workflows – Active Batch – Enterprise Scheduler – Information Lifecycle Management
Data
Collection
Services
Standard
Reports &
Dashboards
Information
Services
Engine
Real-time
Information
Services
Request /
ResponseData Marts
Data Marts
Customer
Hub
L4 – Meta Data
ODBC /
JDBC
Change
Data
Capture
(CDC)
Event
Streaming
Encryption
API
Analytic
Sandbox
QA
OLTP Domain
Applications
Other
Applications
Application
Events
Domain
Applications
Application
Events
Application
Events &
Logging
OLTP Domain
Applications
OLTP Domain
Applications
Other
Applications
Other
Applications
Domain
Applications
Domain
Applications
27© Asurion. All rights reserved.
Using S3 as a Central Data Lake
Non-Production Environments Production Conformed
Storage
ConsumptionEnterprise Data Platform - Storage & Processing
Production S3
Storage
Real-time
Data
Processing
Sourcing
Data MartsDataAccessLayer
DataAccessLayer
Legacy
Development
L1 - Raw storage
L3 - Integrated &
conformed
Self Svc
Reports &
Analysis
L2 - Cleansed &
standardized
AccessVirtualization
Orchestration – Jobs, Plans, Workflows – Active Batch – Enterprise Scheduler – Information Lifecycle Management
Data
Collection
Services
Standard
Reports &
Dashboards
Information
Services
Engine
Real-time
Information
Services
Request /
ResponseData Marts
Data Marts
Customer
Hub
L4 – Meta Data
ODBC /
JDBC
Change
Data
Capture
(CDC)
Event
Streaming
Encryption
API
Analytic
Sandbox
QA
OLTP Domain
Applications
Other
Applications
Application
Events
Domain
Applications
Application
Events
Application
Events &
Logging
OLTP Domain
Applications
OLTP Domain
Applications
Other
Applications
Other
Applications
Domain
Applications
Domain
Applications
28© Asurion. All rights reserved.
Storage & Ingestion – Key Takeaways
Ingestion Transformation
• Handle CR, LF, and delimiter in the data
• Handle time to preserve the time zone
• Handle multi-byte characters – UTF8
• Split & Compress Files 128 MB are more
efficient
• Partition data for performance
• S3 object path is case-sensitive – lowercase
S3
• Compute proximity to storage
• Handle request throttling – partition
data in buckets
• Tagging for cost allocation
Data Transfer to S3
• Python Boto
• Multipart upload
• Storage class (Standard, RRS, IA)
• Lifecycle policies for cost savings
S3 Security
• S3 bucket policies
• IAM, ACL
29© Asurion. All rights reserved.
Data Storage Layers
Layer 1
• Source of truth
• Minimal transformation
• Field Level SPIPII Encryption
• Select a delimiter
• Cleanse data before ingesting for CR, LF, delimiter
• Standardize all times to GMT
Layer 2
• Data Cleansing, Profiling on EMR – Hive
• Partition the data based on query usage – ORC
• Handle deletes and updates – Merge pattern
Layer 3
• Integrate data from multiple systems
• Model based on Data Vault Pattern – EMR
Layer 4
• Conformed, Master & Reference Data
30© Asurion. All rights reserved.
Using EMR for Data Processing
Data Lake
• US
• EU
• JAPAN
Data Lake - Stats
• 50+ Ad hoc Users
• 1000+ Ad hoc Queries  Day
• 20+ Sources of Data
• 100+ ETL Hive Jobs
• 25+ Spark Jobs
• 2+ PB of Data
ADHOC
Transient
ETL Cluster
Hive
Transient
Event Cluster
Spark
RDS
Meta store
Custom Replication
Ideal Path
RDS Replicated
Meta store
To support
Hive  Presto
Data types
Ideal Path
31© Asurion. All rights reserved.
EMR Key Takeaways
EMR
• Use S3 path when creating Hive database
• Use external tables for the data in S3
• Use external metastore
• Recover partitions automatically with MSCK repair
• Alter table to add partitions
• EMRFS - choose between “s3” or “s3n” (not “s3a”)
• Clusters are transient /long running
• Data Pipeline/Python to launch transient cluster
Compression
• Gzip - high compression – not splittable
• Snappy – low compression – splittable
Hive Takeaways
• Partitioning
• De-Normalizing
• Speculative Execution
• File Format – ORC
• hive.vectorized.execution.enabled
• hive.exec.parallel
• hive.hadoop.supports.splittable.combineinputformat
• hive.exec.compress.intermediate
• hive.intermediate.compression.codec
• hive.auto.convert.join
32© Asurion. All rights reserved.
EMR – Presto for Low-Latency BI
Presto
• ORC most interoperable
• Predicate push down
• Low-latency BI ad hoc queries
• R3.2xlarge has yielded better results
Presto – Watch out
• Bucketing is a challenge
• Complex data structures
• Memory limit
• Data type float, char(), varchar()
Presto Settings – Gave optimal results
• query.max-memory
• query.client.timeout
• query.max-age
• query.max-memory-per-node – max. memory on a node for a
query. 42 %
• node-scheduler.max-splits-per-node
• optimizer.columnar-processing
• hive.orc.bloom-filters.enabled
• hive.rcfile-optimized-reader.enabled
• hive.msck.path.validation
33© Asurion. All rights reserved.
EMR - External Hive Metastore with Presto
Metastore
• Best practice is to leverage a single metastore
• Second metastore to handle data type challenges between Hive & Presto
• Modify the table schema to account for data type challenges during sync process
• It’s temporary until Presto supports Hive data types.
• Ex.
• Decimal
• Varchar()
• Char()
• Timestamp – with milliseconds
• Most of the above challenges were resolved in Presto 0.152 (EMR 5.x supports it)
34© Asurion. All rights reserved.
Logical Architecture – Amazon Web Services
Non-Production Environments Production Conformed
Storage
ConsumptionEnterprise Data Platform - Storage & Processing
Production S3
Storage
Real-time
Data
Processing
Sourcing
Data MartsDataAccessLayer
DataAccessLayer
Legacy
Development
L1 - Raw storage
L3 - Integrated &
conformed
Self Svc
Reports &
Analysis
L2 - Cleansed &
standardized
AccessVirtualization
Orchestration – Jobs, Plans, Workflows – Active Batch – Enterprise Scheduler – Information Lifecycle Management
Data
Collection
Services
Standard
Reports &
Dashboards
Information
Services
Engine
Real-time
Information
Services
Request /
ResponseData Marts
Data Marts
Customer
Hub
L4 – Metadata
ODBC /
JDBC
Change
Data
Capture
(CDC)
Event
Streaming
Encryption
API
Analytic
Sandbox
QA
OLTP Domain
Applications
Other
Applications
Application
Events
Domain
Applications
Application
Events
Application
Events &
Logging
OLTP Domain
Applications
OLTP Domain
Applications
Other
Applications
Other
Applications
Domain
Applications
Domain
Applications
35© Asurion. All rights reserved.
EMR Sandbox
Sandbox
• Enables data lake access to users
• Enables dedicated compute capacity –
instead of working with YARN queues
• Enables cost savings with help of EMR
Scaling & Spot
Sandbox
Master
Meta store
Custom Replication
Production Schemas Only
RDS Replicated
Meta store
To support
custom
schemas for
users
Read Only
Sandbox1
ADHOC
PRODUCTION Sandbox-n
ReadWrite
Sandbox S3
Storage
Production
S3 Storage
36© Asurion. All rights reserved.
EMR Sandbox
EMR Sandbox
• Separate metastore for production and sandbox
• Single metastore for multiple sandboxes
• Sync production schema DDL with sandbox
• Assign new schema for each sandbox
EMR Sandbox Security
• Read-only access to Production S3
• Read/write access to Sandbox S3
• Specific sandbox storage with access control
• Controlled via IAM roles & policies
aws emr create-cluster --applications Name=Pig Name=HIVE Name=SPARK Name=ZEPPELIN
Name=GANGLIA --tags 'PLATFORM=ANALYTICS' 'ENVIRONMENT=SB' 'Name=sandbox1' --ec2-attributes
'{"KeyName":“sandbox1","InstanceProfile":"EMR_EC2_Sandbox1Role“}
EMR creation with IAM roles
37© Asurion. All rights reserved.
EMR Sandbox IAM Policies
{
"Effect": "Deny",
"Action": [
"s3:DeleteObject“, “s3:Put*”
],
"Resource": [
"arn:aws:s3:::productions3001/*"]
},
{
"Effect": "Allow",
"Action": [
"s3:Put*“, "s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::reinventsandbox1/sandbo
x1/*"]
}
Policy to allow sandbox read  writePolicy to protect production data
38© Asurion. All rights reserved.
EMR Auto Scaling
Automatic resizing based on
CloudWatch metrics
• Is Cluster Idle
• Containers pending
• Apps Pending
• AVG CPU Usage –
Custom Metric
Define minimum  maximum
instance count
Define when scaling should
occur
39© Asurion. All rights reserved.
EMR Auto Scaling – Ganglia
Based on the CPU usage of EMR cluster, scaling function adds and removes nodes accordingly
CPU metrics is captured from Ganglia using Lambda
40© Asurion. All rights reserved.
EMR Auto Scaling – Custom Metrics
Calculate CPU Usage
Object.keys(datapointSlice).forEach(function(datapointSliceKey){
//console.log('datapointKeyValue',datapointSlice[datapointSliceKey][0]);
if(datapointSlice[datapointSliceKey][0] != 'NaN'){
TotalCPUUsage = TotalCPUUsage + (datapointSlice[datapointSliceKey][0] - 0);
loopCount = loopCount + 1;
}
});//End of for eachfor datapointSlice
AvgCpuUsage = (TotalCPUUsage/loopCount);
Calculate Last 1 Minute Average
datapointCount = Object.keys(ccpu_user_datapoints).length;
datapointSlice = ccpu_user_datapoints.slice(Math.max(datapointCount - 6,1))
Sample Ganglia JSON
http://clusterip/ganglia/graph.php?r=hour&?c=clusterid&m=load_one&s=by+name&mc=2&g=cpu_report&json=1
[{"ds_name":"ccpu_user","cluster_name":"","graph_type":"stack","host_name":"","metric_name":"Userg","col
or":"#3333bb","datapoints":[[2.2783333333,1479435030],[2.0216666667,1479435045],[2.465,1479435060]}
41© Asurion. All rights reserved.
EMR Auto Scaling – Custom Metrics
var CloudWatchMetricparams = {
MetricData: [
{
MetricName: 'CPUAvgUsageLastOneMin',
Timestamp: new Date().toISOString(),
Unit: 'Percent',
Value: AvgCpuUsage,
Dimensions :[
{ Name :'JobFlowId',Value : emrClusterID}
]
}
],
Namespace: 'EMR-CPUUsage'
}; //End of Paramet definition for Alarm
cloudwatch.putMetricData(CloudWatchMetricparams, function(err, data) {
if (err) console.log(err, err.stack);
else console.log(data);
});
CloudWatch alarm
42© Asurion. All rights reserved.
EMR Auto Scaling – Cost Savings
(55% savings when compared with On Demand)
Usage type Count
Blended
cost
Usage
quantity hrs. Unit price
On Demand
price
BoxUsage:d2.2xlarge 4 0.36 200 1.38 276
BoxUsage:r3.2xlarge 1 26.83 50 0.665 33.25
BoxUsage:r3.4xlarge 69 272.65 205 1.33 272.65
SpotUsage:m3.xlarge 30 13.44 300 0.266 79.8
SpotUsage:r3.4xlarge 203 277.67 809 1.33 1075.97
590.97 1564 1737.67
55% cost savings when compared with On Demand
Note:
 Only EC2 cost is depicted here – EMR cost is not added
 BoxUsage:d2.2xlarge – Reserved Instance
 Without Reserved Instance, approx. 40% cost savings
43© Asurion. All rights reserved.
Logical Architecture – Data Vitualization
Non-Production Environments Production Conformed
Storage
ConsumptionEnterprise Data Platform - Storage & Processing
Production S3
Storage
Real-time
Data
Processin
g
Sourcing
Data MartsDataAccessLayer
DataAccessLayer
Legacy
Development
L1 - Raw storage
L3 - Integrated &
conformed
Self Svc
Reports &
Analysis
L2 - Cleansed &
standardized
AccessVirtualization
Orchestration – Jobs, Plans, Workflows – Active Batch – Enterprise Scheduler – Information Lifecycle Management
Data
Collection
Services
Standard
Reports &
Dashboard
s
Information
Services
Engine
Real-time
Information
Services
Request /
ResponseData Marts
Data Marts
Customer
Hub
L4 – Meta Data
ODBC /
JDBC
Change
Data
Capture
(CDC)
Event
Streaming
Encryptio
n
API
Analytic
Sandbox
QA
OLTP Domain
Applications
Other
Applications
Application
Events
Domain
Applications
Application
Events
Application
Events &
Logging
OLTP Domain
Applications
OLTP Domain
Applications
Other
Applications
Other
Applications
Domain
Applications
Domain
Applications
44© Asurion. All rights reserved.
Data Lake Security
corporate data center
Data
Virtualization
Data
Virtualization
ldap
Presto
Data flow
Redshift
ldap
Encryption
Engine
ELT
45© Asurion. All rights reserved.
Data Lake Security
• Data virtualization to handle data lake security
• ANSI SQL compliant and native pushdown enabled
• Enabled column and row level security
• Users are authenticated by on-premises existing LDAP
• Users are authorized based on the roles defined in data virtualization
• Ad hoc queries & reports are through JDBC & ODBC against Amazon Redshift &
Presto
46© Asurion. All rights reserved.
What have we learned?
• Manage cost – adjustments may be required with scale
• Real-time frictionless scaling – automate where possible
• Align to core design patterns
• Leverage readily available solutions
• Design for security and compliance at the start
• Fail forward and adjust as needed
• Harden solution for one market – Integrate regional variance and deploy
globally
Thank you!
Remember to complete
your evaluations!
Additional Slides
50© Asurion. All rights reserved.
Asurion – Life’s Operating System®
Our symbol is inspired by a fingerprint, the intersection of life and technology
Support
Security & PrivacyMobile Insurance Replace & Repair In-Home Service
Extended
Warranties
51© Asurion. All rights reserved.
Data & Analytics at Asurion – Our Mission
Automation &
Faster
Execution
Efficient Data
Analytics
Platform
Smarter
Business
Decisions
Innovative
Data Products
Enable business decisions
with advanced analytics and
new actionable insights
Develop differentiated
capabilities and improved
services
Make quality data accessible,
discover business insights,
deliver analytical solutions
Acquire, process, and mine
structured & unstructured
data assets
52© Asurion. All rights reserved.
EMR Auto Scaling – Spot Price
 Above graph is for Spot Instance pricing history for
r3.4xlarge; during certain time periods, it’s greater than
$12.00
 On-Demand Price for the same instance is $1.33  hr.
 Check the Spot price against the On-Demand price before
placing the bid during Auto Scaling
 Upcoming instance fleets feature will address pulling from
multiple Spot Instance types/bid prices
53© Asurion. All rights reserved.
How does Asurion Enable Business
Secure near real-time accessible data
Platform for data discovery – explore & fail fast
Deploy models and algorithms
Run analytics at scale
54© Asurion. All rights reserved.
Store and Ingestion
Data with real-time and batch needs, that is required to support reporting, analytics, data
science, data mining, predictive modeling, machine learning, data discovery, etc… will be stored
in the data lake.
Storage Requirements
• Ability to handle multiple file types
• Scalable storage
• Tiered storage
• High durability
• High availability
• Store system of record
• Low cost
• Secure storage
• Encrypted at rest
• Rotate encryption periodically
• Segregate data by business unit
Batch Ingestion Requirements
• High throughput
• Guaranteed delivery
• Parallel execution
• High availability
• Track dataflow
• Track lineage
• Encrypt sensitive data
• Metadata driven
• Support multiple systems
• Support incremental processing
55© Asurion. All rights reserved.
EMR - Create Hive Table and Recover Partitions
CREATE EXTERNAL TABLE example ()
PARTITIONED BY(dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyT
extOutputFormat'
LOCATION location
's3n://reinvent/2016/example'
TBLPROPERTIES(
'serialization.null.format'='',
'skip.header.line.count'='1‘
)
MSCK REPAIR TABLE example;
ALTER TABLE ADD PARTITION (dt='2016-11-17')
location 's3n://reinvent/2016/example/dt=2016-
11-17'
56© Asurion. All rights reserved.
EMR - External Hive Table on DynamoDB
CREATE EXTERNAL TABLE IF NOT EXISTS
l0.reinvent2016 ( properties
string, MD5 string)
STORED BY
'org.apache.hadoop.hive.dynamodb.DynamoD
BStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = “dynamo-
reinvent2016
"dynamodb.column.mapping" =
"properties:SNSMessage,md5:md5"
);
SET mapred.job.queue.name=default;
set hive.exec.parallel=true;
set hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
SET dynamodb.throughput.read.percent=.9;
Insert Overwrite table l1.reinvent2016
partition(eventtype,eventdate)
Select
Properties,
get_json_object(reinvent2016.properties,'$.event'),
md5,
<<#Lineage_Key#>>,
get_json_object(reinvent2016.properties,'$.propertie
s.EventType'),
To_date(get_json_object(reinvent2016.properties,'$.p
roperties.EventDateTime'))
From l0. reinvent2016;
57© Asurion. All rights reserved.
EMR - Hive Table – ORC
create external table if not exists
example ()
partitioned by (businessdate string)
Row format serde
org.apache.hadoop.hive.ql.io.orc.OrcSerd
e
location
's3n://reinvent/2016/example'
tblproperties (
'Orc.Create.Index'='true',
'Orc.Row.Index.Stride'='10000',
'orc.compress'='ZLIB',
'orc.stripe.size'='268435456',
)
insert overwrite into table example
partition(businessdate)
select col_a, col_b, businessdate
from l1.table1
group by businessdate ;
58© Asurion. All rights reserved.
EMR Auto Scaling Events
EMR CloudWatch metrics
• Is Cluster Idle
• Containers Pending
• Apps Pending
• YARN Memory Allocated Percentage
EMR custom metrics
• Ganglia bundled with EMR provides
scalable distributed monitoring capability
• Custom metrics in CloudWatch for AVG
CPU usage from Ganglia
{ "MaxNode": "25", "MinNode": "1",
"EC2AvailabilityZone": "us-east-1c",
"NoOfNodeToAdd": "2", "NoOfNodeToRemove": "3",
"newIGNodeCount": "2",
"InstanceType": "r3.4xlarge",
"emrEC2ProductDescriptions": "Linux/UNIX",
"maxTaskIG": "30",
"BidPriceHikePercentage": ".15",
"ScaleUpAlarmName":“scaleup_alarm",
"ScaleDownAlarmName":“scaledown_alarm",
"IGBidPriceDifThreshold":"0.20",
"OnDemandEC2Location": "US East (N. Virginia)",
"OnDemandpreInstalledSw": "NA",
"OnDemandEC2OS": "Linux",
"OnDemandEC2tenancy": "Shared" }

More Related Content

What's hot

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum Efficiency
Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum EfficiencyDeploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum Efficiency
Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum EfficiencyAmazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
How to Scale to Millions of Users with AWS
How to Scale to Millions of Users with AWSHow to Scale to Millions of Users with AWS
How to Scale to Millions of Users with AWSAmazon Web Services
 
Getting started with amazon aurora - Toronto
Getting started with amazon aurora - TorontoGetting started with amazon aurora - Toronto
Getting started with amazon aurora - TorontoAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)Amazon Web Services
 
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon Web Services
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioAlluxio, Inc.
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Database Migration – Simple, Cross-Engine and Cross-Platform Migration
Database Migration – Simple, Cross-Engine and Cross-Platform MigrationDatabase Migration – Simple, Cross-Engine and Cross-Platform Migration
Database Migration – Simple, Cross-Engine and Cross-Platform MigrationAmazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Introduction to Storage on AWS - AWS Summit Cape Town 2017
Introduction to Storage on AWS - AWS Summit Cape Town 2017Introduction to Storage on AWS - AWS Summit Cape Town 2017
Introduction to Storage on AWS - AWS Summit Cape Town 2017Amazon Web Services
 
How to Migrate your Startup to AWS
How to Migrate your Startup to AWSHow to Migrate your Startup to AWS
How to Migrate your Startup to AWSAmazon Web Services
 

What's hot (20)

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum Efficiency
Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum EfficiencyDeploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum Efficiency
Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum Efficiency
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
How to Scale to Millions of Users with AWS
How to Scale to Millions of Users with AWSHow to Scale to Millions of Users with AWS
How to Scale to Millions of Users with AWS
 
Getting started with amazon aurora - Toronto
Getting started with amazon aurora - TorontoGetting started with amazon aurora - Toronto
Getting started with amazon aurora - Toronto
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
 
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Database Migration – Simple, Cross-Engine and Cross-Platform Migration
Database Migration – Simple, Cross-Engine and Cross-Platform MigrationDatabase Migration – Simple, Cross-Engine and Cross-Platform Migration
Database Migration – Simple, Cross-Engine and Cross-Platform Migration
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Introduction to Storage on AWS - AWS Summit Cape Town 2017
Introduction to Storage on AWS - AWS Summit Cape Town 2017Introduction to Storage on AWS - AWS Summit Cape Town 2017
Introduction to Storage on AWS - AWS Summit Cape Town 2017
 
Beyond EC2 and S3
Beyond EC2 and S3Beyond EC2 and S3
Beyond EC2 and S3
 
How to Migrate your Startup to AWS
How to Migrate your Startup to AWSHow to Migrate your Startup to AWS
How to Migrate your Startup to AWS
 

Viewers also liked

Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Serkan Özal
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Sigmoid
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 

Viewers also liked (6)

Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 

Similar to AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (BDM401)

¿Quién es Amazon Web Services?
¿Quién es Amazon Web Services?¿Quién es Amazon Web Services?
¿Quién es Amazon Web Services?Software Guru
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSLam Le
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Amazon Web Services
 
Aws re invent 2018 recap
Aws re invent 2018 recapAws re invent 2018 recap
Aws re invent 2018 recapCloudHesive
 
Introduction to Amazon Relational Database Service (Amazon RDS)
Introduction to Amazon Relational Database Service (Amazon RDS)Introduction to Amazon Relational Database Service (Amazon RDS)
Introduction to Amazon Relational Database Service (Amazon RDS)Amazon Web Services
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Amazon Web Services
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceAmazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group
 
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...Amazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...Amazon Web Services
 
Databases - EBC on the road Brazil Edition [Portuguese]
Databases - EBC on the road Brazil Edition [Portuguese]Databases - EBC on the road Brazil Edition [Portuguese]
Databases - EBC on the road Brazil Edition [Portuguese]Amazon Web Services
 

Similar to AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (BDM401) (20)

Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
¿Quién es Amazon Web Services?
¿Quién es Amazon Web Services?¿Quién es Amazon Web Services?
¿Quién es Amazon Web Services?
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWS
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
Aws re invent 2018 recap
Aws re invent 2018 recapAws re invent 2018 recap
Aws re invent 2018 recap
 
Introduction to Amazon Relational Database Service (Amazon RDS)
Introduction to Amazon Relational Database Service (Amazon RDS)Introduction to Amazon Relational Database Service (Amazon RDS)
Introduction to Amazon Relational Database Service (Amazon RDS)
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
 
Databases - EBC on the road Brazil Edition [Portuguese]
Databases - EBC on the road Brazil Edition [Portuguese]Databases - EBC on the road Brazil Edition [Portuguese]
Databases - EBC on the road Brazil Edition [Portuguese]
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (BDM401)

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz, Sr. Product Manager, Amazon EMR Naveen Avalareddy, Sr. Principal Architect, Asurion November 29, 2016 BDM401 Deep Dive: Amazon EMR Best Practices & Design Patterns
  • 2. What to expect from the session • Overview of Apache ecosystem on EMR • Using EMR with S3 and other AWS services • Securing your EMR application stack • Lowering costs with Auto Scaling and Spot Instances • Building a data lake with EMR at Asurion • Q&A
  • 3.
  • 4. Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop HBase/Phoenix Presto Hue (SQL Interface/Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2/Spark Thriftserver (JDBC/ODBC) Amazon EMR service Amazon EMR release Streaming Flink
  • 5. YARN overview • Dynamically share and centrally configure the same pool of cluster resources across engines • Schedulers for categorizing, isolating, and prioritizing workloads • Spark dynamic allocation of executors • Kerberos authentication
  • 6. YARN schedulers - CapacityScheduler • Default scheduler specified in Amazon EMR • Queues • Single queue is set by default • Can create additional queues for workloads based on multitenancy requirements • Capacity guarantees • set minimal resources for each queue • Programmatically assign free resources to queues • Adjust these settings using the classification capacity- scheduler in an EMR configuration object
  • 7. On-cluster UIs Manage applications SQL editor, Workflow designer, Metastore browser Notebooks Design and execute queries and workloads And more using bootstrap actions!
  • 8. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  • 9. Amazon Kinesis Amazon EMR Amazon EMR Amazon Redshift Elasticsearch Clickstream
  • 10. Use RDS/Aurora for an external Hive metastore Amazon Aurora Hive Metastore for external tables on S3 Amazon S3Set metastore location in hive-site
  • 11. S3 tips: Partitions, compression, and file formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  • 12. New – Run HBase on S3
  • 13. FINRA saved 60% by moving to HBase on EMR
  • 14. Security - Configuring VPC subnets • Use Amazon S3 endpoints in VPC for connectivity to S3 • Use managed NAT for connectivity to other services or the Internet • Control the traffic using security groups • ElasticMapReduce-Master-Private • ElasticMapReduce-Slave-Private • ElasticMapReduce-ServiceAccess
  • 15. Access control by cluster tag and IAM roles Tag: user = MyUserIAM user: MyUser EMR role EC2 role SSH key
  • 16. Fine-grained access control with Apache Ranger • Plug-ins for Hive, HBase, YARN, and HDFS • Configure Hue and Zeppelin authentication with LDAP/AD • Row-level authorization for Hive (with data-masking) • Full auditing capabilities with embedded search • Run Ranger on an edge node – visit the AWS Big Data Blog
  • 17. Encryption – Use security configurations Supported encryption features vary by EMR release
  • 18. Nasdaq uses Presto on Amazon EMR and Amazon Redshift as a tiered data lake Full Presentation: https://www.youtube.com/watch?v=LuHxnOQarXU
  • 19. New - Auto Scaling
  • 20. Coming Soon: Advanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support
  • 21. 21© Asurion. All rights reserved. Building a data lake at Asurion • Introduction • Logical Architecture • S3EMR - Processing • Low Latency Query – Data Analysts • Sandbox – Support Data Scientist • Auto Scaling – Cost Efficiency • Security • Summary / take away points
  • 22. 22© Asurion. All rights reserved. Asurion’s continuous innovation is helping 290M customers globally stay connected while driving loyalty to our partners’ brands • Founded in the mid 1990’s, Asurion has been serving the communications and retail industries for over 20 years • Based in Nashville, Tennessee, Asurion has over 17,000 associates worldwide • Serving more then 290 million consumers globally through our operations in 18 countries: • Asurion is privately-held with annual revenues in excess of $5.8 billion • Our management team comes from best-in-class companies with experience across mobile, wireline telecom, logistics, insurance, service contracts, consulting, customer care, marketing, retail and more • Asurion partners with the worlds leading mobile carriers, retailers cable satellite and cable providers. North America • Global Headquarters • 15 Corporate Owned Call Centers • Logistics Center South America • 2 Corporate Offices Europe • 3 Corporate Offices • 1 Corporate Owned Call Center Asia Pacific • 13 Corporate Offices • Logistics Center • 2 Corporate Owned Call Centers • Australia • Brazil • Canada • China/Hong-Kong • Colombia • England • France • Israel • Japan • Korea • Malaysia • Mexico • Philippines • Peru • Singapore • Taiwan • Thailand • United States Expanding Global Presence Corporate Overview
  • 23. 23© Asurion. All rights reserved. 1 3 8 2015 2016 2017 Claims Telemetry Web Telephony Social 10 Billion interactions anually 52 million voice interactions annually 24 million claims annually 24 million online visitors annually A wide variety of data… …at a large scale… …that continues to accelerate 290 million customers worldwide Petabytes (>1 million GB) by EOY Structured Unstructured Growing Data From Multiple Sources…
  • 24. 24© Asurion. All rights reserved. Data Platform Solution Guiding Principles • Store all enterprise data & enable user access • ELT instead of ETL. Transform JIT • Data quality, an absolute must • Tackle data security at the foundation level • Embrace variability, velocity, and volume • Focus on value, agility, and flexible delivery • Scale On Demand without manual intervention • Pay as you go Platform-as-a-Service based core architecture
  • 25. 25© Asurion. All rights reserved. Data Platform Solution Logical Architecture Non-Production Environments Production Conformed Storage ConsumptionEnterprise Data Platform - Storage & Processing Production S3 Storage Real-time Data Processing Sourcing Data MartsDataAccessLayer DataAccessLayer Legacy Development L1 - Raw storage L3 - Integrated & conformed Self Svc Reports & Analysis L2 - Cleansed & standardized AccessVirtualization Orchestration – Jobs, Plans, Workflows – Enterprise Scheduler – Information Lifecycle Management Data Collection Services Standard Reports & Dashboards Information Services Engine Real-time Information Services Request / ResponseData Marts Data Marts Customer Hub L4 – Meta Data ODBC / JDBC Change Data Capture (CDC) Event Streaming Encryption API Analytic Sandbox QA OLTP Domain Applications Other Applications Application Events Domain Applications Application Events Application Events & Logging OLTP Domain Applications OLTP Domain Applications Other Applications Other Applications Domain Applications Domain Applications Production OLTP systems, events and other data sources Environment Administration Systems using Metadata Data storage optimized for consumption-- Amazon Redshift, Dynamo DB, RDS or others as required Data Management storage and transformation -- Amazon EMR using S3 Storage (PaaS) Development, Business Sandbox and QA environments for creation of new capabilities Business Intelligence, Analytics and Information Services Data Extraction, Conversion & Encryption Data Virtualization manages user access to all data controlled by Active Directory and User Profiles
  • 26. 26© Asurion. All rights reserved. Logical Architecture – Amazon Web Services Non-Production Environments Production Conformed Storage ConsumptionEnterprise Data Platform - Storage & Processing Production S3 Storage Real-time Data Processing Sourcing Data MartsDataAccessLayer DataAccessLayer Legacy Development L1 - Raw storage L3 - Integrated & conformed Self Svc Reports & Analysis L2 - Cleansed & standardized AccessVirtualization Orchestration – Jobs, Plans, Workflows – Active Batch – Enterprise Scheduler – Information Lifecycle Management Data Collection Services Standard Reports & Dashboards Information Services Engine Real-time Information Services Request / ResponseData Marts Data Marts Customer Hub L4 – Meta Data ODBC / JDBC Change Data Capture (CDC) Event Streaming Encryption API Analytic Sandbox QA OLTP Domain Applications Other Applications Application Events Domain Applications Application Events Application Events & Logging OLTP Domain Applications OLTP Domain Applications Other Applications Other Applications Domain Applications Domain Applications
  • 27. 27© Asurion. All rights reserved. Using S3 as a Central Data Lake Non-Production Environments Production Conformed Storage ConsumptionEnterprise Data Platform - Storage & Processing Production S3 Storage Real-time Data Processing Sourcing Data MartsDataAccessLayer DataAccessLayer Legacy Development L1 - Raw storage L3 - Integrated & conformed Self Svc Reports & Analysis L2 - Cleansed & standardized AccessVirtualization Orchestration – Jobs, Plans, Workflows – Active Batch – Enterprise Scheduler – Information Lifecycle Management Data Collection Services Standard Reports & Dashboards Information Services Engine Real-time Information Services Request / ResponseData Marts Data Marts Customer Hub L4 – Meta Data ODBC / JDBC Change Data Capture (CDC) Event Streaming Encryption API Analytic Sandbox QA OLTP Domain Applications Other Applications Application Events Domain Applications Application Events Application Events & Logging OLTP Domain Applications OLTP Domain Applications Other Applications Other Applications Domain Applications Domain Applications
  • 28. 28© Asurion. All rights reserved. Storage & Ingestion – Key Takeaways Ingestion Transformation • Handle CR, LF, and delimiter in the data • Handle time to preserve the time zone • Handle multi-byte characters – UTF8 • Split & Compress Files 128 MB are more efficient • Partition data for performance • S3 object path is case-sensitive – lowercase S3 • Compute proximity to storage • Handle request throttling – partition data in buckets • Tagging for cost allocation Data Transfer to S3 • Python Boto • Multipart upload • Storage class (Standard, RRS, IA) • Lifecycle policies for cost savings S3 Security • S3 bucket policies • IAM, ACL
  • 29. 29© Asurion. All rights reserved. Data Storage Layers Layer 1 • Source of truth • Minimal transformation • Field Level SPIPII Encryption • Select a delimiter • Cleanse data before ingesting for CR, LF, delimiter • Standardize all times to GMT Layer 2 • Data Cleansing, Profiling on EMR – Hive • Partition the data based on query usage – ORC • Handle deletes and updates – Merge pattern Layer 3 • Integrate data from multiple systems • Model based on Data Vault Pattern – EMR Layer 4 • Conformed, Master & Reference Data
  • 30. 30© Asurion. All rights reserved. Using EMR for Data Processing Data Lake • US • EU • JAPAN Data Lake - Stats • 50+ Ad hoc Users • 1000+ Ad hoc Queries Day • 20+ Sources of Data • 100+ ETL Hive Jobs • 25+ Spark Jobs • 2+ PB of Data ADHOC Transient ETL Cluster Hive Transient Event Cluster Spark RDS Meta store Custom Replication Ideal Path RDS Replicated Meta store To support Hive Presto Data types Ideal Path
  • 31. 31© Asurion. All rights reserved. EMR Key Takeaways EMR • Use S3 path when creating Hive database • Use external tables for the data in S3 • Use external metastore • Recover partitions automatically with MSCK repair • Alter table to add partitions • EMRFS - choose between “s3” or “s3n” (not “s3a”) • Clusters are transient /long running • Data Pipeline/Python to launch transient cluster Compression • Gzip - high compression – not splittable • Snappy – low compression – splittable Hive Takeaways • Partitioning • De-Normalizing • Speculative Execution • File Format – ORC • hive.vectorized.execution.enabled • hive.exec.parallel • hive.hadoop.supports.splittable.combineinputformat • hive.exec.compress.intermediate • hive.intermediate.compression.codec • hive.auto.convert.join
  • 32. 32© Asurion. All rights reserved. EMR – Presto for Low-Latency BI Presto • ORC most interoperable • Predicate push down • Low-latency BI ad hoc queries • R3.2xlarge has yielded better results Presto – Watch out • Bucketing is a challenge • Complex data structures • Memory limit • Data type float, char(), varchar() Presto Settings – Gave optimal results • query.max-memory • query.client.timeout • query.max-age • query.max-memory-per-node – max. memory on a node for a query. 42 % • node-scheduler.max-splits-per-node • optimizer.columnar-processing • hive.orc.bloom-filters.enabled • hive.rcfile-optimized-reader.enabled • hive.msck.path.validation
  • 33. 33© Asurion. All rights reserved. EMR - External Hive Metastore with Presto Metastore • Best practice is to leverage a single metastore • Second metastore to handle data type challenges between Hive & Presto • Modify the table schema to account for data type challenges during sync process • It’s temporary until Presto supports Hive data types. • Ex. • Decimal • Varchar() • Char() • Timestamp – with milliseconds • Most of the above challenges were resolved in Presto 0.152 (EMR 5.x supports it)
  • 34. 34© Asurion. All rights reserved. Logical Architecture – Amazon Web Services Non-Production Environments Production Conformed Storage ConsumptionEnterprise Data Platform - Storage & Processing Production S3 Storage Real-time Data Processing Sourcing Data MartsDataAccessLayer DataAccessLayer Legacy Development L1 - Raw storage L3 - Integrated & conformed Self Svc Reports & Analysis L2 - Cleansed & standardized AccessVirtualization Orchestration – Jobs, Plans, Workflows – Active Batch – Enterprise Scheduler – Information Lifecycle Management Data Collection Services Standard Reports & Dashboards Information Services Engine Real-time Information Services Request / ResponseData Marts Data Marts Customer Hub L4 – Metadata ODBC / JDBC Change Data Capture (CDC) Event Streaming Encryption API Analytic Sandbox QA OLTP Domain Applications Other Applications Application Events Domain Applications Application Events Application Events & Logging OLTP Domain Applications OLTP Domain Applications Other Applications Other Applications Domain Applications Domain Applications
  • 35. 35© Asurion. All rights reserved. EMR Sandbox Sandbox • Enables data lake access to users • Enables dedicated compute capacity – instead of working with YARN queues • Enables cost savings with help of EMR Scaling & Spot Sandbox Master Meta store Custom Replication Production Schemas Only RDS Replicated Meta store To support custom schemas for users Read Only Sandbox1 ADHOC PRODUCTION Sandbox-n ReadWrite Sandbox S3 Storage Production S3 Storage
  • 36. 36© Asurion. All rights reserved. EMR Sandbox EMR Sandbox • Separate metastore for production and sandbox • Single metastore for multiple sandboxes • Sync production schema DDL with sandbox • Assign new schema for each sandbox EMR Sandbox Security • Read-only access to Production S3 • Read/write access to Sandbox S3 • Specific sandbox storage with access control • Controlled via IAM roles & policies aws emr create-cluster --applications Name=Pig Name=HIVE Name=SPARK Name=ZEPPELIN Name=GANGLIA --tags 'PLATFORM=ANALYTICS' 'ENVIRONMENT=SB' 'Name=sandbox1' --ec2-attributes '{"KeyName":“sandbox1","InstanceProfile":"EMR_EC2_Sandbox1Role“} EMR creation with IAM roles
  • 37. 37© Asurion. All rights reserved. EMR Sandbox IAM Policies { "Effect": "Deny", "Action": [ "s3:DeleteObject“, “s3:Put*” ], "Resource": [ "arn:aws:s3:::productions3001/*"] }, { "Effect": "Allow", "Action": [ "s3:Put*“, "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::reinventsandbox1/sandbo x1/*"] } Policy to allow sandbox read writePolicy to protect production data
  • 38. 38© Asurion. All rights reserved. EMR Auto Scaling Automatic resizing based on CloudWatch metrics • Is Cluster Idle • Containers pending • Apps Pending • AVG CPU Usage – Custom Metric Define minimum maximum instance count Define when scaling should occur
  • 39. 39© Asurion. All rights reserved. EMR Auto Scaling – Ganglia Based on the CPU usage of EMR cluster, scaling function adds and removes nodes accordingly CPU metrics is captured from Ganglia using Lambda
  • 40. 40© Asurion. All rights reserved. EMR Auto Scaling – Custom Metrics Calculate CPU Usage Object.keys(datapointSlice).forEach(function(datapointSliceKey){ //console.log('datapointKeyValue',datapointSlice[datapointSliceKey][0]); if(datapointSlice[datapointSliceKey][0] != 'NaN'){ TotalCPUUsage = TotalCPUUsage + (datapointSlice[datapointSliceKey][0] - 0); loopCount = loopCount + 1; } });//End of for eachfor datapointSlice AvgCpuUsage = (TotalCPUUsage/loopCount); Calculate Last 1 Minute Average datapointCount = Object.keys(ccpu_user_datapoints).length; datapointSlice = ccpu_user_datapoints.slice(Math.max(datapointCount - 6,1)) Sample Ganglia JSON http://clusterip/ganglia/graph.php?r=hour&?c=clusterid&m=load_one&s=by+name&mc=2&g=cpu_report&json=1 [{"ds_name":"ccpu_user","cluster_name":"","graph_type":"stack","host_name":"","metric_name":"Userg","col or":"#3333bb","datapoints":[[2.2783333333,1479435030],[2.0216666667,1479435045],[2.465,1479435060]}
  • 41. 41© Asurion. All rights reserved. EMR Auto Scaling – Custom Metrics var CloudWatchMetricparams = { MetricData: [ { MetricName: 'CPUAvgUsageLastOneMin', Timestamp: new Date().toISOString(), Unit: 'Percent', Value: AvgCpuUsage, Dimensions :[ { Name :'JobFlowId',Value : emrClusterID} ] } ], Namespace: 'EMR-CPUUsage' }; //End of Paramet definition for Alarm cloudwatch.putMetricData(CloudWatchMetricparams, function(err, data) { if (err) console.log(err, err.stack); else console.log(data); }); CloudWatch alarm
  • 42. 42© Asurion. All rights reserved. EMR Auto Scaling – Cost Savings (55% savings when compared with On Demand) Usage type Count Blended cost Usage quantity hrs. Unit price On Demand price BoxUsage:d2.2xlarge 4 0.36 200 1.38 276 BoxUsage:r3.2xlarge 1 26.83 50 0.665 33.25 BoxUsage:r3.4xlarge 69 272.65 205 1.33 272.65 SpotUsage:m3.xlarge 30 13.44 300 0.266 79.8 SpotUsage:r3.4xlarge 203 277.67 809 1.33 1075.97 590.97 1564 1737.67 55% cost savings when compared with On Demand Note:  Only EC2 cost is depicted here – EMR cost is not added  BoxUsage:d2.2xlarge – Reserved Instance  Without Reserved Instance, approx. 40% cost savings
  • 43. 43© Asurion. All rights reserved. Logical Architecture – Data Vitualization Non-Production Environments Production Conformed Storage ConsumptionEnterprise Data Platform - Storage & Processing Production S3 Storage Real-time Data Processin g Sourcing Data MartsDataAccessLayer DataAccessLayer Legacy Development L1 - Raw storage L3 - Integrated & conformed Self Svc Reports & Analysis L2 - Cleansed & standardized AccessVirtualization Orchestration – Jobs, Plans, Workflows – Active Batch – Enterprise Scheduler – Information Lifecycle Management Data Collection Services Standard Reports & Dashboard s Information Services Engine Real-time Information Services Request / ResponseData Marts Data Marts Customer Hub L4 – Meta Data ODBC / JDBC Change Data Capture (CDC) Event Streaming Encryptio n API Analytic Sandbox QA OLTP Domain Applications Other Applications Application Events Domain Applications Application Events Application Events & Logging OLTP Domain Applications OLTP Domain Applications Other Applications Other Applications Domain Applications Domain Applications
  • 44. 44© Asurion. All rights reserved. Data Lake Security corporate data center Data Virtualization Data Virtualization ldap Presto Data flow Redshift ldap Encryption Engine ELT
  • 45. 45© Asurion. All rights reserved. Data Lake Security • Data virtualization to handle data lake security • ANSI SQL compliant and native pushdown enabled • Enabled column and row level security • Users are authenticated by on-premises existing LDAP • Users are authorized based on the roles defined in data virtualization • Ad hoc queries & reports are through JDBC & ODBC against Amazon Redshift & Presto
  • 46. 46© Asurion. All rights reserved. What have we learned? • Manage cost – adjustments may be required with scale • Real-time frictionless scaling – automate where possible • Align to core design patterns • Leverage readily available solutions • Design for security and compliance at the start • Fail forward and adjust as needed • Harden solution for one market – Integrate regional variance and deploy globally
  • 50. 50© Asurion. All rights reserved. Asurion – Life’s Operating System® Our symbol is inspired by a fingerprint, the intersection of life and technology Support Security & PrivacyMobile Insurance Replace & Repair In-Home Service Extended Warranties
  • 51. 51© Asurion. All rights reserved. Data & Analytics at Asurion – Our Mission Automation & Faster Execution Efficient Data Analytics Platform Smarter Business Decisions Innovative Data Products Enable business decisions with advanced analytics and new actionable insights Develop differentiated capabilities and improved services Make quality data accessible, discover business insights, deliver analytical solutions Acquire, process, and mine structured & unstructured data assets
  • 52. 52© Asurion. All rights reserved. EMR Auto Scaling – Spot Price  Above graph is for Spot Instance pricing history for r3.4xlarge; during certain time periods, it’s greater than $12.00  On-Demand Price for the same instance is $1.33 hr.  Check the Spot price against the On-Demand price before placing the bid during Auto Scaling  Upcoming instance fleets feature will address pulling from multiple Spot Instance types/bid prices
  • 53. 53© Asurion. All rights reserved. How does Asurion Enable Business Secure near real-time accessible data Platform for data discovery – explore & fail fast Deploy models and algorithms Run analytics at scale
  • 54. 54© Asurion. All rights reserved. Store and Ingestion Data with real-time and batch needs, that is required to support reporting, analytics, data science, data mining, predictive modeling, machine learning, data discovery, etc… will be stored in the data lake. Storage Requirements • Ability to handle multiple file types • Scalable storage • Tiered storage • High durability • High availability • Store system of record • Low cost • Secure storage • Encrypted at rest • Rotate encryption periodically • Segregate data by business unit Batch Ingestion Requirements • High throughput • Guaranteed delivery • Parallel execution • High availability • Track dataflow • Track lineage • Encrypt sensitive data • Metadata driven • Support multiple systems • Support incremental processing
  • 55. 55© Asurion. All rights reserved. EMR - Create Hive Table and Recover Partitions CREATE EXTERNAL TABLE example () PARTITIONED BY(dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyT extOutputFormat' LOCATION location 's3n://reinvent/2016/example' TBLPROPERTIES( 'serialization.null.format'='', 'skip.header.line.count'='1‘ ) MSCK REPAIR TABLE example; ALTER TABLE ADD PARTITION (dt='2016-11-17') location 's3n://reinvent/2016/example/dt=2016- 11-17'
  • 56. 56© Asurion. All rights reserved. EMR - External Hive Table on DynamoDB CREATE EXTERNAL TABLE IF NOT EXISTS l0.reinvent2016 ( properties string, MD5 string) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoD BStorageHandler' TBLPROPERTIES ( "dynamodb.table.name" = “dynamo- reinvent2016 "dynamodb.column.mapping" = "properties:SNSMessage,md5:md5" ); SET mapred.job.queue.name=default; set hive.exec.parallel=true; set hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.dynamic.partitions.pernode=100000; SET dynamodb.throughput.read.percent=.9; Insert Overwrite table l1.reinvent2016 partition(eventtype,eventdate) Select Properties, get_json_object(reinvent2016.properties,'$.event'), md5, <<#Lineage_Key#>>, get_json_object(reinvent2016.properties,'$.propertie s.EventType'), To_date(get_json_object(reinvent2016.properties,'$.p roperties.EventDateTime')) From l0. reinvent2016;
  • 57. 57© Asurion. All rights reserved. EMR - Hive Table – ORC create external table if not exists example () partitioned by (businessdate string) Row format serde org.apache.hadoop.hive.ql.io.orc.OrcSerd e location 's3n://reinvent/2016/example' tblproperties ( 'Orc.Create.Index'='true', 'Orc.Row.Index.Stride'='10000', 'orc.compress'='ZLIB', 'orc.stripe.size'='268435456', ) insert overwrite into table example partition(businessdate) select col_a, col_b, businessdate from l1.table1 group by businessdate ;
  • 58. 58© Asurion. All rights reserved. EMR Auto Scaling Events EMR CloudWatch metrics • Is Cluster Idle • Containers Pending • Apps Pending • YARN Memory Allocated Percentage EMR custom metrics • Ganglia bundled with EMR provides scalable distributed monitoring capability • Custom metrics in CloudWatch for AVG CPU usage from Ganglia { "MaxNode": "25", "MinNode": "1", "EC2AvailabilityZone": "us-east-1c", "NoOfNodeToAdd": "2", "NoOfNodeToRemove": "3", "newIGNodeCount": "2", "InstanceType": "r3.4xlarge", "emrEC2ProductDescriptions": "Linux/UNIX", "maxTaskIG": "30", "BidPriceHikePercentage": ".15", "ScaleUpAlarmName":“scaleup_alarm", "ScaleDownAlarmName":“scaledown_alarm", "IGBidPriceDifThreshold":"0.20", "OnDemandEC2Location": "US East (N. Virginia)", "OnDemandpreInstalledSw": "NA", "OnDemandEC2OS": "Linux", "OnDemandEC2tenancy": "Shared" }