AWS January 2016 Webinar Series - Getting Started with Big Data on AWS

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Erik Swensson, Shree Kenghe & Erick Dame
January 26, 2016
Getting Started with Big Data
Analytic Options on AWS & Common Use Cases

Table of Contents
• Big Data Introduction for AWS
• Big Data Analytics Option on AWS
• Usage Patterns & Anti-Patterns
• Performance & Cost
• Durability & Scalability
• Interfaces
• Building Big Data Analytic Solutions – The AWS Approach
• Example Scenarios

Big Data on AWS
Immediate Availability. Deploy instantly. No hardware to
procure, no infrastructure to maintain & scale
Trusted & Secure. Designed to meet the strictest
requirements. Continuously audited, including certifications
such as ISO 27001, FedRAMP, DoD CSM, and PCI DSS.
Broad & Deep Capabilities. Over 50 services and 100s of
features to support virtually any big data application &
workload
Hundreds of Partners & Solutions. Get help from a
consulting partner or choose from hundreds of tools and
applications across the entire data management stack.

Real-time
Amazon Kinesis Firehose
Object Storage
Amazon S3
RDBMS
Amazon RDS
NoSQL
DynamoDB
Hadoop Ecosystem
Amazon EMR
Real-time
AWS Lambda
Amazon Kinesis Analytics
Data Warehousing
Amazon Redshift
Machine Learning
Amazon Machine
Learning
Business Intelligence &
Data Visualization
Amazon QuickSight
Real-time
Amazon Kinesis Streams
Elastic Search Analytics
Amazon ElasticSearch
Collect Store Process & Analyze Visualize
Data Import
Amazon Import/Export
Snowball
IoT
Amazon IoT
Broad, Tightly Integrated Capabilities

Petabyte scale
Massively parallel
Relational data warehouse
Fully managed, zero admin
As low as $1,000/TB/Year
a lot faster
a lot cheaper
a whole lot simpler
Amazon Redshift

Amazon Redshift
• Ideal Usage Patterns - Analyze
• Sales data
• Historical data
• Gaming data
• Social trends
• Ad data
• Performance
• Massively Parallel Processing
• Columnar Storage
• Data Compression
• Zone Maps
• Direct-attached Storage
• Cost model
• No upfront costs or long term commitments
• Free backup storage equivalent to 100% of
provisioned storage
With columnar storage, you
only read the data you need

Amazon Redshift
• Scalability & Elasticity
• Resize or scale - Number or type of nodes
can be changed with a few clicks
• Durability and Availability
• Replication
• Backup
• Automated recovery from failed drives &
nodes
• Interfaces
• JDBC/ODBC interface with BI/ETL tools
• Amazon S3 or DynamoDB
• Anti-patterns
• Small datasets
• OLTP
• Unstructured Data
• Blob Data
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

Ingest streaming data
Process data in real-time
Store terabytes of data per hour
Amazon Kinesis

• Ideal Usage Patterns – Streaming
data ingestion and processing
• Real-time data analytics
• Data feed intake and processing e.g. logs
• Real-time metrics and reporting
• Performance
• Throughput capacity in terms of shards
• Cost model
• No upfront costs or long term
commitments
• Pay as you go pricing
• Hourly charge per shard
• Charge for 1 million PUT transactions

• Scale – increase number of shards
• Replication
• Cursor preservation
• Interfaces
• Input – data coming in
• Output – data going out
• Kinesis Firehose
• Anti-patterns
• Small scale consistent throughput
• Long term data storage and analytics

Launch a cluster in minutes
Pay by the hour and save with spot
MapReduce, Apache Spark, Presto
Amazon EMR

Amazon EMR
• Ideal Usage Patterns
• Log processing and analytics
• Large ETL and data movement
• Risk modeling and threat analytics
• Ad targeting and click stream analytics
• Genomics
• Predictive analytics
• Ad-hoc data mining and analytics
• Performance – driven by
• Type of instance
• Number of instances
• Cost model
• Only pay for hours the cluster is up
• EC2 instance and EMR price

Amazon EMR
• Resize a running cluster
• Add more core or task nodes
• Fault tolerant for slave node (HDFS)
• Backup to S3 for resilience against master
node failures
• Interfaces
• Hive, Pig, Spark, Hbase, Impala, Hunk,
Presto, other popular tools
• Anti-patterns
• Small data sets
• ACID (Atomicity, Consistency, Isolation and
Durability)
Amazon EMR Cluster
Amazon EMR Cluster
Amazon EMR Cluster

Fully managed NoSQL database
Single-Digit Millisecond latency at scale
Supports document and key-value
Amazon
DynamoDB

Amazon DynamoDB
• Mobile apps, gaming, digital ad serving, live
voting, sensor networks, log ingestion
• Access control for web-based content, e-
commerce shopping carts
• Web session management
• Performance
• SSD
• Provision throughput by table
• No limit to the amount of data stored
• Dial-up or dial-down the read and write capacity
of a table
• Cost Model
• Pay as you go
• Provisioned throughput capacity (per hour)
• Indexed data storage (per GB per month)
• Data transfer in or out (per GB per month)
 Provisioned read/write performance per table.
 Predictable high performance scaled via console or API

Amazon DynamoDB
• Three Availability Zones (AZ)
• Interfaces
• AWS Management Console
• API’s
• SDK’s
• Anti-patterns
• Application tied to traditional relational
database
• Joins and or complex transactions
• BLOB data
• Large data with low I/O rate
AZ-A
AZ-B
AZ-C

Managed service designed to make it easy
for developers of all levels to use machine
learning
Based on the same ML technology used for
years by Amazon’s internal data scientists
Amazon Machine Learning uses scalable
and robust implementations of industry-
standard ML algorithms
Amazon
Machine Learning

Amazon Machine Learning
• Enable applications that flag suspicious
transactions
• Personalize application content
• Predict user activity
• Listen to social media
• Cost Model
• Pay for what you use
• No need to manage instances, only pay for
the service
• Performance
• Real-time predictions designed to return
within 100ms
• 200 transactions can be handled per second
by default (can be raised)

Amazon Machine Learning
• No maintenance windows or scheduled
downtimes
• Designed across multiple availability
zones
• Model training up to 100GB
• Multiple ML jobs can run simultaneously
• Interfaces
• Create a data source from S3, RDS and
Redshift
• Interact with ML via console, SDKs, and
the ML API
• Anti-patterns
• Massive Data Sets for modeling >
100GB
• Sequence prediction or unsupervised
clustering task

Event driven, fully managed
compute
No Infrastructure to Manage
Automatic Scaling
AWS Lambda

AWS Lambda
• Real-time file processing
• Extract, Transform, Load
• Performance
• Process events within
milliseconds
• Cost Model
• Pay for what you use
• No need to manage instances,
only pay for the service
• Lambda free tier includes 1M free
requests
1 2 3
Serverless Event-Driven Scale Subsecond Billing

AWS Lambda
• No maintenance windows or
scheduled downtime
• Async functions are retried 3 times if
there is a failure
• Any number of concurrent functions that
can be run
• AWS Lambda will dynamically allocate
capacity to match the rate of incoming
events.
• Interfaces
• Lambda supports Java, Node.js, and
Python
• Trigger via event or schedule
• Anti-patterns
• Long running applications
• Stateful applications in Lambda

Setup Elasticsearch cluster in minutes
Integrated with Logstash and Kibana
Scale Elasticsearch cluster seamlessly
Amazon
Elasticsearch
Service

Amazon Elasticsearch
• Analyze logs
• Analyze data stream updates from other AWS
services
• Provide customers a rich search and navigation
experience
• Usage monitoring for mobile applications
• Performance
• Depends on multiple factors including instance
type, workload, index, number of shards used, read
replicas
• Storage configurations –instance storage or EBS
storage
• Cost Model
• Pay as you go
• Only pay for compute and storage

Amazon Elasticsearch
• Zone Awareness
• Automatic and Manual snapshots
• Add or remove instances
• Modify EBS volumes for data growth
• Interfaces
• AWS Management Console
• API’s
• SDK’s
• Kibana and Logstash (ELK Stack)
• Anti-patterns
• OLTP
• Workloads needing larger than 5TB of
storage requirements
Elasticsearch + Logstash + Kibana =
real-time analytics & visualization

Build visualizations
Perform ad-hoc analysis
Share and collaborate via storyboards
Native access on major mobile platforms
Amazon
QuickSight

Introducing Amazon QuickSight
Cloud-Powered Business Intelligence Service For
1/10th the Cost of Traditional BI Software
 No IT effort. No dimensional modeling
 Auto-discovery of all AWS data sources
 Super-fast, Parallel, In-memory Calculation Engine
(SPICE)
 Fully Managed
Available in Preview
aws.amazon.com/quicksight

Scale up or down as needed
Pay for what you use
Multiple options
Do-it-yourself big data applications
Amazon EC2

The AWS Approach
• Flexible Use the best tool for the job
• Data structure, latency, throughput, access patterns
• Scalable Immutable (append-only)
• Batch/speed/serving layer
• Minimum Admin Overhead Leverage AWS managed services
• No or very low admin
• Low Cost Big data ≠ big cost

Scenario 1: Enterprise Data Warehouse
Scenario 2: Capturing and Analyzing Sensor Data
Scenario 3: Sentiment Analysis of Social Media
Big Data
Scenarios

Scenario 1: Enterprise Data Warehouse
Data Warehouse Architecture
Data
Sources
Amazon
S3
Amazon
EMR
Amazon
S3
Amazon
Redshift
Amazon
QuickSight

Scenario 2: Capturing and Analyzing Sensor Data
Data
Sources
Amazon
S3
Amazon
Redshift
Amazon
QuickSight
Amazon
Kinesis
Enabled
App
Amazon
Kinesis
Enabled
App
Amazon
DynamoDB
Reporting
Dashboard
Customer
Access
Amazon
Kinesis
1
2 3 4 5
6 7 8 9

Scenario 3: Sentiment Analysis of Social Media
Social
Media Data
Amazon
EC2
Amazon
Lambda
Amazon
ML
Amazon
Kinesis
Amazon
S3
Amazon
SNS
1 2 4 5 6
3 7

Next Steps
• Subscribe to the AWS Big Data Blog
blogs.aws.amazon.com/bigdata
• Learn more, check the tutorials, guides, and self-paced labs
aws.amazon.com/big-data
• Register for the next Big Data Webinar
Building Smart Applications with Amazon Machine Learning
aws.amazon.com/about-aws/events/monthlywebinarseries
Thu, Jan 28 2016 | 9AM PST

AWS January 2016 Webinar Series - Getting Started with Big Data on AWS

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to AWS January 2016 Webinar Series - Getting Started with Big Data on AWS

Similar to AWS January 2016 Webinar Series - Getting Started with Big Data on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS January 2016 Webinar Series - Getting Started with Big Data on AWS

Editor's Notes