Analytics in the Cloud

Storage

Tools &
Compute
Support

Databases

Compute
Storage

Tools &
Support

Databases

Capturing and managing
data is challenging

Lots of data,
lots of uses,
lots of users

Lots of data,
lots of uses,
lots of users,
lots of locations

Click through
Social graph Log files

Additional value
Audit trails Customer usage

Transcoding

Analytics

Data intensive, Tightly
scale out coupled

Elastic MapReduce
Managed Hadoop

Undifferentiated
heavy lifting

S3

Input data

Code Elastic
MapReduce

S3

Input data

Code Elastic Name
MapReduce node

S3

Input data

Code Elastic Name
MapReduce node

Elastic
cluster

S3

Input data

Code Elastic Name
MapReduce node

HDFS

Elastic
cluster

S3

Input data

Code Elastic Name
MapReduce node

Queries
HDFS
+ BI
Via JDBC, Pig, Hive
Elastic
cluster

S3

Input data

Code Elastic Name Output
MapReduce node S3 + SimpleDB

Queries
HDFS
+ BI
Via JDBC, Pig, Hive
Elastic
cluster

S3

Input data

Output
S3 + SimpleDB

Hive, Pig,
Cascading,
Streaming

14 hours
Time remaining: 14 hours

14 hours
Time remaining: 7 hours

Balance cost and
performance

Resize based on
usage patterns

Steady state Steady state
Batch processing

Live data in DynamoDB

CREATE EXTERNAL TABLE orders_ddb_2012_01 ( order_id
string, customer_id string, order_date bigint, total
double )
STORED BY
'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler
' TBLPROPERTIES (
"dynamodb.table.name" = "Orders-2012-01",
"dynamodb.column.mapping" = "order_id:Order
ID,customer_id:Customer ID,order_date:Order
Date,total:Total"
);

Query DynamoDB

SELECT customer_id, sum(total) spend, count(*)
order_count
FROM orders_ddb_2012_01
WHERE order_date >= unix_timestamp('2012-01-01', 'yyyy-
MM-dd')
AND order_date < unix_timestamp('2012-01-08', 'yyyy-MM-
dd')
GROUP BY customer_id
ORDER BY spend desc
LIMIT 5 ;

Archived data in S3

CREATE EXTERNAL TABLE orders_s3_export ( order_id
string, customer_id string, order_date int, total
double )
PARTITIONED BY (year string, month string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LOCATION 's3://elastic-mapreduce/samples/ddb-orders' ;

Query S3

SELECT year, month, customer_id, sum(total) spend,
count(*) order_count
FROM orders_s3_export
WHERE customer_id = 'c-2cC5fF1bB'
AND month >= 6
AND year = 2011
GROUP BY customer_id, year, month
ORDER by month desc;

Export to S3
CREATE EXTERNAL TABLE orders_s3_new_export ( order_id
string, customer_id string, order_date int, total
double )
PARTITIONED BY (year string, month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://';

INSERT OVERWRITE TABLE
orders_s3_new_export
PARTITION (year='2012', month='01')
SELECT * from orders_ddb_2012_01;

Drug discovery
Financial risk Social media &
analysis gaming

Parallel computation
Manufacturing Transcoding &
& design rendering
Genomics

CC1 + GPU
Cluster compute instances

16 Intel Xeon cores Placement groups

CC2
Non-blocking, fully bisectional 10 gig E network

240 TFLOPS
42nd faster supercomputer

StarCluster

web.mit.edu/star/cluster

CloudFormation

aws.amazon.com/hpc

Analytics in the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Analytics in the Cloud

Similar to Analytics in the Cloud (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Analytics in the Cloud