Building Real-Time Data Analytics Applications on AWS - September 2016 Webinar Series

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Keith Steward, Ph.D.,
Specialist (EMR) Solutions Architect
September 29, 2016
Building Real-Time Processing
Data Analytics Applications on
AWS
Build powerful applications easily

Agenda
1. Introduction to Amazon EMR and Kinesis
2. Real Time Processing Application Demo
3. Best Practices

Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Customize the cluster

Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Mahout,
Sqoop
HBase/Phoenix
Presto
Hue (SQL Interface/Metastore
Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2/Spark Thriftserver
(JDBC/ODBC)
Amazon EMR service

Options to submit jobs to EMR
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
Airflow, Luigi, or other
schedulers on EC2
Create pipeline
to schedule job
submission or create
complex workflows
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
AWS
Lambda
AWS Data
Pipeline

Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
EMR File
System
(EMRFS)
Amazon S3
Amazon EMR

Quick introduction to Spark
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle

Spark components to match your use case

Spark 2.0 – Performance Enhancements
• Second generation Tungsten engine
• Whole-stage code generation to create optimized
bytecode at runtime
• Improvements to Catalyst optimizer for query
performance
• New vectorized Parquet decoder to increase throughput

Datasets & DataFrames (Spark 2.0)
Datasets
• Distributed collection of data
• Strong typing, can use Lambda
functions
• Object-oriented operations (similar to
RDD API)
• Optimized encoders for better
performance; minimize serialization
/deserialization overhead
• Compile-time type safety for robust
applications
DataFrames
• Dataset organized into named
columns
• Represented as a Dataset of rows

Spark SQL (Spark 2.0)
• SparkSession – replaces the old SQLContext and
HiveContext
• Seamlessly mix SQL with Spark programs
• ANSI SQL Parser and subquery support
• HiveQL compatibility and can directly use tables in
Hive metastore
• Connect through JDBC / ODBC using the Spark Thrift
server

Spark 2.0 – Structured Streaming
• Structured Streaming API: extension to the
DataFrame/Dataset API (instead of DStream)
• SparkSession is the new entry point for streaming
• Better merges processing on static and streaming
datasets, abstracting the velocity of the data

Configuring Executors – Dynamic Allocation
• Optimal resource utilization
• YARN dynamically adjusts executors based on resource
needs of Spark application
• Spark uses the executor memory and executor cores
settings in the configuration for each executor
• Amazon EMR uses dynamic allocation by default,
and calculates the default executor size to use based on
the instance family of your Core Group

Try different configurations to find your optimal architecture.
CPU
c3 family
c4 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m4 family
m3 family
m1 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS

Zeppelin 0.6.1 – New Features
• Shiro Authentication
• Notebook Authorization
Configurations to save notebook in S3
zeppelin-env.sh:
export ZEPPELIN_NOTEBOOK_S3_BUCKET = bucket_name
export ZEPPELIN_NOTEBOOK_S3_USER = username
(optional) export ZEPPELIN_NOTEBOOK_S3_KMS_KEY_ID = kms-key-id

Amazon Kinesis
Streams
Build your own custom
applications that
process or analyze
streaming data
Amazon Kinesis
Firehose
Easily load massive
volumes of streaming
data into Amazon S3
and Redshift
Amazon Kinesis
Analytics
Easily analyze data
streams using
standard SQL queries
Amazon Kinesis: Streaming data made easy
Services make it easy to capture, deliver, and process streams on AWS

Amazon Kinesis Concepts
Streams: • Ordered sequence of data records
Data Records: • Unit of data stored in a Kinesis stream
Producers: • Anything that puts records into a Kinesis stream
• Kinesis Producer Library (KPL)
Consumers: • End-points that get records from a Kinesis stream &
processes them
• Kinesis Consumer Library (KCL)

Amazon Kinesis Streams features…
• Kinesis Client Library in Python, Node.JS,
Ruby…
• PutRecords API, up to 500 records or 5 MB of payload
• Kinesis Producer Library to simplify producer development
• Server-Side Timestamps
• individual max record payload up to 1 MB
• Low end-to-end propagation delay
• Stream Retention defaults to 24 hrs, but can increase to 7 days

Real Time Processing
Application
and Demo

Real Time Processing Application
+
Kinesis Producer
(KCL)
Amazon Kinesis
Apache Zeppelin
+
Produce Collect Process Analyze
Amazon
EMR

Real Time Processing Application – 5 Steps
http://bit.ly/realtime-aws
1. Create Kinesis Stream
2. Create Amazon EMR Cluster
3. Start Kinesis Producer
4. Ad-hoc Analytics
• Analyze data using Apache Zeppelin on
Amazon EMR with Spark Streaming
5. Continuous Streaming
• Spark Streaming application submitted to
Amazon EMR cluster using Step API
Amazon
Kinesis
Amazon
EMR
Amazon
EC2
(Kinesis
Producer)
(Kinesis
Consumer)

1. Create Kinesis Stream:
$ aws kinesis create-stream --stream-name spark-demo --shard-count 2

2. Create Amazon EMR Cluster with Spark and Zeppelin:
$ aws emr create-cluster --release-label emr-5.0.0
--applications Name=Zeppelin Name=Spark Name=Hadoop
--enable-debugging
--ec2-attributes KeyName=test-key-1,AvailabilityZone=us-east-1d
--log-uri s3://kinesis-spark-streaming1/logs
--instance-groups
Name=Master,InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1
Name=Core,InstanceGroupType=CORE,InstanceType=m3.xlarge,InstanceCount=2
Name=Task,InstanceGroupType=TASK,InstanceType=m3.xlarge,InstanceCount=2
--name "kinesis-processor"

3. Start Kinesis Producer (Using Kinesis Producer Library)
a. On a host (e.g. EC2), download JAR and run Kinesis Producer:
$ wget https://s3.amazonaws.com/chayel-emr/KinesisProducer.jar
$ java –jar KinesisProducer.jar
b. Continuous stream of data records will be fed into Kinesis in CSV format:
… device_id,temperature,timestamp …
Git link: https://github.com/manjeetchayel/emr-kpl-demo

Use Zeppelin on Amazon
EMR:
• Configure Spark interpreter to use
org.apache.spark:spark-streaming-
kinesis-asl_2.11:2.0.0 dependency
• Import notebook from
https://raw.githubusercontent.com/m
anjeetchayel/aws-big-data-
blog/master/aws-blog-realtime-
analytics-using-
zeppelin/Spark_Streaming.json
• Run spark code blocks to generate
real-time analytics.

Best Practices - Kinesis
• Use Kinesis Producer Library
• Increase capacity utilization with Aggregation
• Setup CloudWatch alarms
• Randomized Partitions key to distribute puts across
Shards
• Deal with unsuccessful records
• Use PutRecords for higher throughput!
• Kinesis Scaling Utility for scaling Streams

Best Practices – Amazon EMR + Spark Streaming
• Use Step API to submit long running applications
• Run Spark application in cluster mode
• Use maxResourceAllocation
• Unqiue Spark Application name for KCL DynamoDB
metadata
• Setup CloudWatch alarms
• Use IAM roles

Thank you!
stewardk@amazon.com
http://aws.amazon.com/emr
http://blogs.aws.amazon.com/bigdata

Building Real-Time Data Analytics Applications on AWS - September 2016 Webinar Series

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Building Real-Time Data Analytics Applications on AWS - September 2016 Webinar Series