Getting Started with Big Data and HPC in the Cloud - August 2015

Getting Started with Big Data and
HPC in the Cloud
KD Singh
AWS Solutions Architect
kdsingh@amazon.com

BIG DATA :
Big Data Challenges:
Capacity
Planning &
Scalability
Lower Cost,
OpEx
Experiment &
learn more
Trad.
DWHs
IT
Complexity
Data
Variety…
..Volume,
velocity
Old Answers &
Questions
Managed
Services
Fully managed,
secured &
automated
services that
brings agility &
focus
S3, EMR,
Kinesis,
DynamoDB:
Collect all data, do
Complex
computations and
processing it, both
in Real-Time &
Batch
Sensors
(IoT)
Social
Images
Videos
E. Apps.
Documents
Web Logs
Big Value
Redshift:
Super Fast, MPP,
Petabyte Ready,
analytical Data
Warehouse
available in
minutes
Virtually
unlimited &
Elastic
Resources
No heavy lifting &
Reduced Time to
Market, parallel
processing on
demand
New
Answers/questions
& Business Ideas
Extract the meaning
from all your data &
focus on new
business Ideas,
Models, etc..
High Cost &
Commitment

Plethora of Tools
Glacier
S3 DynamoDB
RDS
EMR
Redshift
Data Pipeline
Kinesis
Cassandra
CloudSearch

Ingest Store Analyze Visualize
Data Answers
Time
Simplify Data Analytics Flow
Multiple stages
Storage decoupled from processing

Glacier
S3
DynamoDB
RDS
Kinesis
Spark
Streaming
EMR
Ingest Store Process/Analyze Visualize
Data Pipeline
Storm
Kafka
Redshift
Cassandra
CloudSearch
Kinesis
Connector
Kinesis
enabled app
App Server
Web Server
Devices

AWS Big Data Portfolio
Collect / Ingest
Kinesis
Process / Analyze
EMR EC2
Redshift Data Pipeline
Visualize / ReportStore
Glacier
S3
DynamoDB
RDS
Import Export
Direct Connect
Amazon SQS

Industries using AWS for data analysis
Mobile / Cable
Telecom
Oil and Gas
Industrial
Manufacturing
Retail/Consumer
Entertainment
Hospitality
Life Sciences
Scientific
Exploration
Financial
Services
Publishing Media
Advertising
Online Media
Social Network
Gaming

Ingest: The act of collecting and storing data

Types of Data Ingest
• Transactional
– Database reads/writes
• File
– Media files, log files
• Stream
– Click-stream logs (sets of
events)
Database
Cloud
Storage
Stream
Storage
LoggingFrameworksDevicesApps

Real-time processing of streaming data
High throughput
Elastic
Easy to use
Connectors for EMR, S3, Redshift,
DynamoDB
Amazon
Kinesis

Kinesis Stream:
Managed ability to capture and store data
• Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by adding or
removing Shards
• Replay data inside of 24Hr. Window
• Kinesis Client Library (KCL) - Java
client library, available on Github

Sending and Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library
+
Connector Library
Apache Storm
Amazon Elastic
MapReduce
Sending Reading
Write Read

AWS Partners for Ingest, Data Load and Transformation
Hparser, Big Data Edition
Flume, Sqoop

Cloud Database and Storage Tier Anti-pattern
App/Web Tier
Client Tier
Database & Storage Tier

Cloud Database and Storage Tier — Use the Right Tool for the Job!
App/Web Tier
Client Tier
Data Tier
Search
Hadoop/HDFS
Cache
Blob Store
SQL NoSQL

Amazon RDSAmazon
DynamoDB
Amazon
ElastiCache
Amazon S3
Amazon
Glacier
Amazon
CloudSearch
HDFS on Amazon EMR
Cloud Database and Storage Tier — Use the Right Tool for the Job!

Structured – Simple Query
NoSQL
Amazon DynamoDB
Cache
Amazon ElastiCache
(Memcached, Redis)
Structured – Complex Query
SQL
Amazon RDS
Data Warehouse
Amazon Redshift
Search
Amazon CloudSearch
Unstructured – No Query
Cloud Storage
Amazon S3
Amazon Glacier
Unstructured – Custom Query
Hadoop/HDFS
Amazon Elastic Map Reduce
DataStructureComplexity
Query Structure Complexity
What Database and Storage Should I Use?

What Data Store Should I Use?
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
CloudSearch
Amazon
EMR (HDFS)
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec sec,min,h
rs
ms,sec,min
(~ size)
hrs
Data volume GB GB–TBs
(no limit)
GB–TB
(3 TB Max)
GB–TB GB–PB
(~nodes)
GB–PB
(no limit)
GB–PB
(no limit)
Item size B-KB KB
(64 KB max)
KB
(~rowsize)
KB
(1 MB max)
MB-GB KB-GB
(5 TB max)
GB
(40 TB max)
Request rate Very High Very High High High Low –
Very High
Low–
Very High
(no limit)
Very Low
(no limit)
Storage cost
$/GB/month
$$ ¢¢ ¢¢ $ ¢ ¢ ¢
Durability Low -
Moderate
Very High High High High Very High Very High
Hot Data Warm Data Cold Data

Store anything
Object storage
Scalable
Designed for 99.999999999% durability
Amazon
S3

Aggregate All Data in S3 Surrounded by a collection of the right tools
EMR Kinesis
Redshift DynamoDB RDS
Data Pipeline
Spark
Streaming
Cassandra Storm
S3
• No limit on the number of objects
• Object size up to 5TB
• Central data storage for all
systems
• High bandwidth
• 99.999999999% durability
• Versioning, Lifecycle Policies
• Glacier Integration

Fully-managed NoSQL database service
Built on solid-state drives (SSDs)
Consistent low latency performance
Any throughput rate
No storage limits
Amazon
DynamoDB

DynamoDB: Managed High Availability and Durability
• Scaling without down-time
• Automatic sharding
• Security inspections, patches,
upgrades
• Automatic hardware failover
• Multi-AZ replication
• Hardware configuration
designed specifically for
DynamoDB
• Performance tuning

Relational Databases
Fully managed; zero admin
MySQL, PostgreSQL, Oracle, SQL Server
and Aurora
Amazon
RDS

Characterizing HPC
Embarrassingly parallel
Elastic
Batch workloads
Loosely
Coupled
Interconnected jobs
Network sensitivity
Job specific algorithms
Tightly
Coupled
Data management
Task distribution
Workflow management
Supporting
Services
EC2 instance types
Auto Scaling
CLI, User Data
SQL, NoSQL, Object Stores
Simple Queuing Service
Simple Workflow Service
Implement MPI
GPU Programming

Tightly coupled
Compute and Memory optimized instances
Implement HVM process execution
Intel® Xeon® processors
10 Gigabit Ethernet with Enhanced Networking, SR-IOV
c4.8xlarge
36 vCPUs
2.9 GHz Intel Xeon
E5-2666v3 Haswell
60 GB RAM
4000 Mbps dedicated
EBS throughput
r3.8xlarge
32 vCPUs
2.8 GHz Intel Xeon
E5-2670v2 Ivy Bridge
244GB RAM
2 x 320 GB
Local SSD

GPU Processing
CG1 instances
Intel® Xeon® X5570 processors
2 x NVIDIA Tesla M2050 GPUs
CUDA, OpenCL
G2 instances
Intel® Xeon® E5-2670 processors
4 x NVIDIA GRID K520 GPUs
Each GPU with 1536 CUDA cores and 4GB
of video memory
DirectX, OpenGL

Network Placement Groups
Cluster instances deployed in a Placement Group enjoy
low latency, full bisection 10 Gbps bandwidth
Connect multiple Placement Groups to create very large
clusters
Enhanced networking with SR-IOV provides higher
packet per second (PPS) performance, lower inter-
instance latencies, and very low network jitter
10Gbps

Low cost with flexible pricing Efficient clusters
Unlimited infrastructure
Faster time to results
Concurrent Clusters on-demand
Increased collaboration
Why AWS for HPC?

Customers running HPC workloads on AWS

Popular HPC workloads on AWS
Genome
processing
Modeling and
Simulation
Government and
Educational Research
Monte Carlo
Simulations
Transcoding and
Encoding
Computational
Chemistry

Across several key industry verticals
Utilities Biopharma Materials Design
Manufacturing Academic
Research
Auto & Aerospace

TOP500: 76th fastest supercomputer on-demand
Jun 2014 Top 500 list
484.2 TFlop/s
26,496 cores in a cluster of
EC2 C3 instances
Intel Xeon E5-2680v2 10C
2.800GHzprocessors
LinPack Benchmark

Three types of data-driven development
Retrospective/
Batch
analysis and
reporting
Here-and-now/
Stream
real-time processing
and dashboards
Predictions
to enable smart
applications
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon Redshift
Amazon RDS
Amazon EMR
Amazon Machine
Learning

Columnar data warehouse
ANSI SQL compatible
Massively parallel
Petabyte scale
Fully-managed
Very cost-effective
Amazon
Redshift

Amazon Redshift architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3
– Parallel load from Amazon DynamoDB
• Hardware optimized for data processing
• Two hardware platforms
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

Amazon Redshift dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With row storage you do
unnecessary I/O
• To get total amount, you have to
read everything

• Zone maps
• With column storage, you only
read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375

• Column storage
• Track of the minimum and
maximum value for each
block
• Skip over blocks that don’t
contain the data needed for
a given query
• Minimize unnecessary I/O

• Column storage
• Zone maps
• Use direct-attached storage to
maximize throughput
• Hardware optimized for high
performance data processing
• Large block sizes to make the
most of each read
• Amazon Redshift manages
durability for you

Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce

EMR Cluster
S3
1. Put the data
into S3
2. Choose: Hadoop distribution, #
of nodes, types of nodes, Hadoop
apps like Hive/Pig/HBase
4. Get the output
from S3
3. Launch the cluster using
the EMR console, CLI, SDK, or
APIs
How Does EMR Work?

EMR
EMR Cluster
S3
You can easily resize the
cluster
And launch parallel
clusters using the same
data
How Does EMR Work?

EMR
EMR Cluster
S3
Use Spot
nodes to save
time and
money
How Does EMR Work?

The Hadoop Ecosystem works inside of EMR

Easy to use, managed machine learning
service built for developers
Robust, powerful machine learning
technology based on Amazon’s internal
systems
Create models using your data already
stored in the AWS cloud
Deploy models to production in seconds
Amazon
Machine
Learning

Three Supported Types of Predictions
• Binary Classification
– Predict the answer to a Yes/No question
• Multi-class classification
– Predict the correct category from a list
• Regression
– Predict the value of a numeric variable

Partners – Advances Analytics (Scientific, algorithmic, predictive, etc)

AWS Partners for BI & Data Visualization

Spark
Streaming,
Apache
Storm,
Native Amazon
Redshift
Spark, Impala,
Presto
Hive
Amazon
Redshift
Hive
Spark,
Presto
Amazon
Kinesis/
Kafka
Amazon
DynamoDB
Amazon S3Data
Hot ColdData Temperature
QueryLatency
Low
High
HDFS
Hive
Native
Client
Data Temperature vs Query Latency
Real-Time Batch
Answers

Amazon EMR
as ETL Grid
and Analysis
Amazon Redshift –
Production DWH
VisualizationLogs
Traffic Statistics
Demo

https://aws.amazon.com/big-data/
https://aws.amazon.com/testdrive/bigdata/
https://aws.amazon.com/training/course-
descriptions/bigdata/
https://aws.amazon.com/hpc/
Thank You!

Getting Started with Big Data and HPC in the Cloud - August 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Getting Started with Big Data and HPC in the Cloud - August 2015

Similar to Getting Started with Big Data and HPC in the Cloud - August 2015 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Getting Started with Big Data and HPC in the Cloud - August 2015