4. Routine works
• Data processing
• Data mining (Marketing)
• Machine learning (Matching)
• Item-Item: Users who liked this
item also liked
• User-Item: Users who are similar
to you also liked
4Captain monkey - Sean.Chang
14. Full-Controllable or not
Amazon
Kinesis
Apache
Kafka
14
Unit Stream Topic
Distribution Shards Partitions
Thoughtput
2 MB read/shards
1 MB write/shards
Based on cluster size
Fault tolerance Handled by AWS Replica
Transformer AWS Lambda Connectors/Processors
Framework Support Spark and Flink
16. Cost / Performance
Amazon
S3
Amazon
EMR
Type Objects Block device
Throughput Middle High
Cost 0.025 USD/GB 0.12x3 USD/GB (EBS gp2)
Maintenance
No
(Policy / Lifecycle)
Yes
Libraries
Hadoop-aws
(3.x.x is better)
ALL
16
Storing Apache Hadoop Data on the Cloud - HDFS vs. S3
Top 5 Reasons for Choosing S3 over HDFS
29. • “XXX as a Service” first - why not
• Full-Controllable or not
• Cost / Performance (pay-as-you-go)
• Computing environment depends on scenario.
29