How can you use Big Data to grow your business and discover new opportunities? When organizations effectively capture, analyze, visualize and apply big data insights to their business goals, they differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line. With Amazon Web Services, businesses and researchers can easily fulfill their high performance computing (HPC) requirements with the added benefit of ad-hoc provisioning, pay-as-you-go pricing and faster time-to-results. Join this session to understand how to run HPC applications in AWS cloud, and about different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
DSPy a system for AI to Write Prompts and Do Fine Tuning
Getting Started with Big Data and HPC in the Cloud - August 2015
1. Getting Started with Big Data and
HPC in the Cloud
KD Singh
AWS Solutions Architect
kdsingh@amazon.com
2. BIG DATA :
Big Data Challenges:
Capacity
Planning &
Scalability
Lower Cost,
OpEx
Experiment &
learn more
Trad.
DWHs
IT
Complexity
Data
Variety…
..Volume,
velocity
Old Answers &
Questions
Managed
Services
Fully managed,
secured &
automated
services that
brings agility &
focus
S3, EMR,
Kinesis,
DynamoDB:
Collect all data, do
Complex
computations and
processing it, both
in Real-Time &
Batch
Sensors
(IoT)
Social
Images
Videos
E. Apps.
Documents
Web Logs
Big Value
Redshift:
Super Fast, MPP,
Petabyte Ready,
analytical Data
Warehouse
available in
minutes
Virtually
unlimited &
Elastic
Resources
No heavy lifting &
Reduced Time to
Market, parallel
processing on
demand
New
Answers/questions
& Business Ideas
Extract the meaning
from all your data &
focus on new
business Ideas,
Models, etc..
High Cost &
Commitment
6. AWS Big Data Portfolio
Collect / Ingest
Kinesis
Process / Analyze
EMR EC2
Redshift Data Pipeline
Visualize / ReportStore
Glacier
S3
DynamoDB
RDS
Import Export
Direct Connect
Amazon SQS
7. Industries using AWS for data analysis
Mobile / Cable
Telecom
Oil and Gas
Industrial
Manufacturing
Retail/Consumer
Entertainment
Hospitality
Life Sciences
Scientific
Exploration
Financial
Services
Publishing Media
Advertising
Online Media
Social Network
Gaming
9. Types of Data Ingest
• Transactional
– Database reads/writes
• File
– Media files, log files
• Stream
– Click-stream logs (sets of
events)
Database
Cloud
Storage
Stream
Storage
LoggingFrameworksDevicesApps
10. Real-time processing of streaming data
High throughput
Elastic
Easy to use
Connectors for EMR, S3, Redshift,
DynamoDB
Amazon
Kinesis
11. Kinesis Stream:
Managed ability to capture and store data
• Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by adding or
removing Shards
• Replay data inside of 24Hr. Window
• Kinesis Client Library (KCL) - Java
client library, available on Github
12. Sending and Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library
+
Connector Library
Apache Storm
Amazon Elastic
MapReduce
Sending Reading
Write Read
13. AWS Partners for Ingest, Data Load and Transformation
Hparser, Big Data Edition
Flume, Sqoop
16. Cloud Database and Storage Tier — Use the Right Tool for the Job!
App/Web Tier
Client Tier
Data Tier
Database & Storage Tier
Search
Hadoop/HDFS
Cache
Blob Store
SQL NoSQL
17. Database & Storage Tier
Amazon RDSAmazon
DynamoDB
Amazon
ElastiCache
Amazon S3
Amazon
Glacier
Amazon
CloudSearch
HDFS on Amazon EMR
Cloud Database and Storage Tier — Use the Right Tool for the Job!
18. Structured – Simple Query
NoSQL
Amazon DynamoDB
Cache
Amazon ElastiCache
(Memcached, Redis)
Structured – Complex Query
SQL
Amazon RDS
Data Warehouse
Amazon Redshift
Search
Amazon CloudSearch
Unstructured – No Query
Cloud Storage
Amazon S3
Amazon Glacier
Unstructured – Custom Query
Hadoop/HDFS
Amazon Elastic Map Reduce
DataStructureComplexity
Query Structure Complexity
What Database and Storage Should I Use?
19. What Data Store Should I Use?
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
CloudSearch
Amazon
EMR (HDFS)
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec sec,min,h
rs
ms,sec,min
(~ size)
hrs
Data volume GB GB–TBs
(no limit)
GB–TB
(3 TB Max)
GB–TB GB–PB
(~nodes)
GB–PB
(no limit)
GB–PB
(no limit)
Item size B-KB KB
(64 KB max)
KB
(~rowsize)
KB
(1 MB max)
MB-GB KB-GB
(5 TB max)
GB
(40 TB max)
Request rate Very High Very High High High Low –
Very High
Low–
Very High
(no limit)
Very Low
(no limit)
Storage cost
$/GB/month
$$ ¢¢ ¢¢ $ ¢ ¢ ¢
Durability Low -
Moderate
Very High High High High Very High Very High
Hot Data Warm Data Cold Data
21. Aggregate All Data in S3 Surrounded by a collection of the right tools
EMR Kinesis
Redshift DynamoDB RDS
Data Pipeline
Spark
Streaming
Cassandra Storm
S3
• No limit on the number of objects
• Object size up to 5TB
• Central data storage for all
systems
• High bandwidth
• 99.999999999% durability
• Versioning, Lifecycle Policies
• Glacier Integration
22. Fully-managed NoSQL database service
Built on solid-state drives (SSDs)
Consistent low latency performance
Any throughput rate
No storage limits
Amazon
DynamoDB
23. DynamoDB: Managed High Availability and Durability
• Scaling without down-time
• Automatic sharding
• Security inspections, patches,
upgrades
• Automatic hardware failover
• Multi-AZ replication
• Hardware configuration
designed specifically for
DynamoDB
• Performance tuning
28. GPU Processing
CG1 instances
Intel® Xeon® X5570 processors
2 x NVIDIA Tesla M2050 GPUs
CUDA, OpenCL
G2 instances
Intel® Xeon® E5-2670 processors
4 x NVIDIA GRID K520 GPUs
Each GPU with 1536 CUDA cores and 4GB
of video memory
DirectX, OpenGL
29. Network Placement Groups
Cluster instances deployed in a Placement Group enjoy
low latency, full bisection 10 Gbps bandwidth
Connect multiple Placement Groups to create very large
clusters
Enhanced networking with SR-IOV provides higher
packet per second (PPS) performance, lower inter-
instance latencies, and very low network jitter
10Gbps
30. Low cost with flexible pricing Efficient clusters
Unlimited infrastructure
Faster time to results
Concurrent Clusters on-demand
Increased collaboration
Why AWS for HPC?
32. Popular HPC workloads on AWS
Genome
processing
Modeling and
Simulation
Government and
Educational Research
Monte Carlo
Simulations
Transcoding and
Encoding
Computational
Chemistry
33. Across several key industry verticals
Utilities Biopharma Materials Design
Manufacturing Academic
Research
Auto & Aerospace
34. TOP500: 76th fastest supercomputer on-demand
Jun 2014 Top 500 list
484.2 TFlop/s
26,496 cores in a cluster of
EC2 C3 instances
Intel Xeon E5-2680v2 10C
2.800GHzprocessors
LinPack Benchmark
35. Three types of data-driven development
Retrospective/
Batch
analysis and
reporting
Here-and-now/
Stream
real-time processing
and dashboards
Predictions
to enable smart
applications
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon Redshift
Amazon RDS
Amazon EMR
Amazon Machine
Learning
36. Columnar data warehouse
ANSI SQL compatible
Massively parallel
Petabyte scale
Fully-managed
Very cost-effective
Amazon
Redshift
37. Amazon Redshift architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3
– Parallel load from Amazon DynamoDB
• Hardware optimized for data processing
• Two hardware platforms
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
38. Amazon Redshift dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With row storage you do
unnecessary I/O
• To get total amount, you have to
read everything
39. Amazon Redshift dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• With column storage, you only
read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
40. Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• Columnar compression saves
space & reduces I/O
• Amazon Redshift analyzes and
compresses your data
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
41. Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Direct-attached storage
• Large data block sizes
• Track of the minimum and
maximum value for each
block
• Skip over blocks that don’t
contain the data needed for
a given query
• Minimize unnecessary I/O
42. Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• Use direct-attached storage to
maximize throughput
• Hardware optimized for high
performance data processing
• Large block sizes to make the
most of each read
• Amazon Redshift manages
durability for you
43. Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce
44. EMR Cluster
S3
1. Put the data
into S3
2. Choose: Hadoop distribution, #
of nodes, types of nodes, Hadoop
apps like Hive/Pig/HBase
4. Get the output
from S3
3. Launch the cluster using
the EMR console, CLI, SDK, or
APIs
How Does EMR Work?
45. EMR
EMR Cluster
S3
You can easily resize the
cluster
And launch parallel
clusters using the same
data
How Does EMR Work?
48. Easy to use, managed machine learning
service built for developers
Robust, powerful machine learning
technology based on Amazon’s internal
systems
Create models using your data already
stored in the AWS cloud
Deploy models to production in seconds
Amazon
Machine
Learning
49. Three Supported Types of Predictions
• Binary Classification
– Predict the answer to a Yes/No question
• Multi-class classification
– Predict the correct category from a list
• Regression
– Predict the value of a numeric variable