The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
2. Three Types of Data Analytics
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
apps
6. Fluentd: Open Source Log Collection
• Fluentd is an open source
data collector to unify data
collection and consumption
• Integration into many data
sources (App Logs, Syslogs,
Twitter etc.)
• Direct integration into AWS
such as S3 & Kinesis
<source>
type tail
format apache2
path /var/log/apache2/access_log
tag s3.apache.access
</source>
<match s3.*.*>
type s3
s3_bucket myweblogs
path logs/
</match>
8. Amazon S3
• Highly available object storage designed
for 99.999999999% data durability
• Replicated across 3 facilities
• Virtually unlimited scale
• Pay only for what you use, you don’t
need to pre-provision
• Allows event notifications to trigger
further action
• Ideal for a data lake (single source of truth)
Amazon S3
10. Amazon DynamoDB
• Schemaless Data Model
• Seamless scalability
• No storage or throughput limits
• Consistent low latency performance
• High durability and availability
• Replicated across 3 facilities
DynamoDB
table
items
a*ributes
Fully Managed NoSQL Database Service
14. Stream in Real Time: Amazon Kinesis
• Real-Time Data Processing over
large distributed streams
• Elastic capacity that scales to
millions of events per second
• React In real-time upon incoming
stream events
• Reliable stream storage
replicated across 3 facilities
Amazon Kinesis
17. Amazon EMR
• Amazon EMR is a fully managed
Hadoop cluster
• Transient and long running clusters
• Direct integration into Amazon S3
and Amazon Kinesis
• Easy to scale and enable burstable
capacity
• Integration with AWS Spot Market
18. 1 instance x 100 hours = 100 instances x 1 hour
(and with Spot Pricing not only faster but also cheaper)
19. Process – Amazon EMR
• Amazon EMR supports all common
Hadoop Frameworks such as:
• Spark, Pig, Hive, Hue, Oozie …
• Hbase, Presto, Impala …
• Decouples storage from compute
• Allows independent scaling
• Direct Integration with DynamoDB
and S3 (EMRFS)
Amazon S3
Amazon
DynamoDB
Amazon EMR
20. • FINRA regulates trading practices of
brokerage firms and exchange markets to
protect market integrity
• Market surveillance platform stores
30 billion market events every day
• Leverages Amazon S3 to store events
and allow analysts to interactively query
market dynamics using Amazon EMR
Hive & HBase clusters with increased
agility
Re-Architecting Compliance
Unlimited
Storage
Distributed
Computing
Interactive Market
Queries
Ensure
compliance
30 billion market
events
21. Apache Spark
• Apache Spark is an in-memory
analytics cluster using RDD (Resilient
Distributed Dataset) for fast processing
• Faster than Map-Reduce due to
removal of shuffling phases to HDFS
• Apache Spark Streaming can read
directly from DynamoDB, S3 and a
Kinesis stream
22. Processing Amazon Kinesis streams
Amazon
Kinesis
EMR with
Spark Streaming
KinesisUtils.createStream(‘twitter-stream’)
.filter(_.getText.contains(‘Big Data’))
.countByWindow(Seconds(5))
Counting tweets on a sliding window
24. Amazon Redshift
• Fully managed petabyte-scale data
warehouse
• Scalable amount of cluster nodes
• ODBC/JDBC connector for BI tools
using SQL
• Supports Amazon DynamoDB and
Amazon S3 to load data
• Less than a 10th of a cost of traditional
solutions
Amazon Redshift
26. AWS Marketplace
• Pre-Configured machine images
ready to be launched into virtual
server instances
• Launch applications with 1-Click
• Pay software licenses by the
hour or bring your own license
(BYOL)
28. InfoCluster
InfoForce is a cloud based BIG DATA solution provider specializing in Near Real-Time High Throughput
Analytics and Machine Learning algorithms
Application Server via REST
Organization
Data Lakes
InfoForce.co
Data Pipes Analytics
Machine
Learning
30. We reviewed 1,800+ merchant, total 3.5 million SKU, transaction volume US $80 mil in last 90 days and
found that over 86% listings don’t have any sales performance in last 60 days.*
2% 3%
3%1%
5%
86%
Hi-exposure Mid-hi exposure Mid-low exposure
New listed 7 days New listed 30 days Low exposure
Condition %
Hi Exposure 1.9%
Mi-hi Exposure 3.4%
Mid-Low Exposure 2.8%
New listed in 7 days 0.6%
New listed in 30 days 5%
Low Exposure 86.3%
Total 100%
Definition!
!
1. High exposure - sold in last 7 days!
2. Mid-high exposure - sold in last 30 days but no in last 7 days!
3. Mid-low exposure - sold in last 60 days but no in last 7 days!
4. Low exposure - No sold history in last 60 days!
Understand the e-commerce market
31. Instead of setting fixed selling price, Price optimiser offer DYNAMIC Pricing strategy. Based on product sales
cycle and compare similar product sales cycle in market to design price rules for each merchandise.
6050 Strategies
Product
Similarity
Automatically Managed Listings
1. Price Optimization
32. To discover product‘s relationship, InfoForce based on textual analysis engine derives relationship from
catalog meta-data and machine learn from consumption patterns.
Flora By Gucci Eau De Parfum Spray 50ml
gucci, flora, eau_de_parfum_spray, 50ml
Brand Model Product
Gucci, eau_de_toilette
Gucci, flora
eau_de_parfum
75ml
Vera_Wang, floral
eau_de_parfum_spray
50ml
2. Cross-sell Bundler
36. Scalability
TBs of Data
Compute & IO heavy
Dynamically grow
Timeliness
Streaming data
Results needs to be
available FAST
Always available
Secure
Privacy
Fine grain control on
resources
Persistence & Backup
Challenges
37. Challenge No More
EC2
Solid platform to scale our compute
capabilities. Leveraging AWS IAM
services to grant API level access to spin
up additional machines when necessary.
Kinesis Streaming
Core to our architecture are the real-time data
pipes which is built on top of AWS Kinesis
where we can provision shards number based
on throughput requirements without having to
worry about the complexities of setting up a
production grade streaming service
DynamoDB
Serve as a reliable way to persist over
important meta data such as check
pointing and stream schema info
Kinesis Firehose
Firehose is used to archive any data
received in from the data pipes for long
term batch and real-time analytics by the
calculation cluster
S3 and CloudFront
S3 is used as a way to persist data, it
also has handy integrations into Apache
Spark for distributed easy distributed
computing. CloudFront is used as a CDN
for fast raw data delivery into apps
CloudWatch
InfoCluster uses built-in and custom
CloudWatch metrics to store and monitor
services, the useful alarm functionality
notifies the operation team of any issues
Scalability Timeliness Secure
38. AWS Services Usage (1)
Amazon EC2
Solid platform to scale our compute capabilities.
Leveraging AWS IAM services to grant API level
access to spin up additional machines when
necessary.
Amazon Kinesis Firehose
Firehose is used to archive any data received in
from the data pipes for long term batch and real-
time analytics by the calculation cluster
39. AWS Services Usage (2)
Amazon Kinesis
Core to our architecture are the real-time data pipes which are
built on top of Amazon Kinesis where we can provision shards
number based on throughput requirements without having to
worry about the complexities of setting up a production grade
streaming service.
Amazon S3 and Amazon CloudFront
S3 is used as a way to persist data, it also has handy integrations
into Apache Spark for easy distributed computing. CloudFront is
used as a CDN for fast raw data delivery into apps
40. AWS Services Usage (3)
Amazon DynamoDB
Serve as a reliable way to persist over important meta
data such as check pointing and stream schema info
Amazon CloudWatch
InfoCluster uses built-in and custom CloudWatch metrics
to store and monitor services. The useful alarm
functionality notifies the operation team of any issues.
41. Price Optimizer
>700,000 items actively
managed
Generated > $1.4mil USD
revenue for merchants,
counting 51,014 sold pcs.
Cross-Sell Bundler
Successful integration with word
tagging infrastructure
Soft launch early Jun 2016.
Generate independent bundling result
in 30 sec for 2,000 items with 10
bundle recommendation each.
Winner of HK ICT Awards 2016
Best Startup
Best Smart Hong Kong (Big
Data Application)
Results
42. Merchant provide 50K listings to adopt our price optimizer, which 2% of it (around 1,000 listings) were re-
activated to sales and generate over US $80K GMV in 20 days.
Successful Case
43. • Business
• Look for application of technology in other domain / industries
• Plug-in / API Level access to InfoCluster’s functionalities
• Further investment into machine learning
• Technology
• Storage optimization using ST1 and SC1 EBS
• Optimization in balance between continuous resource vs compute
services such as lambda
• Full integration of custom processes with AWS CloudWatch
• PubSub notification framework
Future Plan
44. • We have progressed so far beyond justifying cloud beyond cost...
• Builders like to build
• Unlock your organization data lakes, use InfoForce (preferably !)
Summary
45. “We see our customers as
invited guests to a party,
and we are the hosts. It’s
our job to make every
important aspect of the
customer experience
a little bit better.”
Jeff Bezos
CEO, Amazon.com
46. Data analysis for a better customer experience
• Your business creates and stores
data and logs all the time
• Data points and logs allow you to
understand individual customer
experience and improve it
• Analysis of logs and trails help
gain insights
47. Big Data:
• Massive Datasets
• Experimental style of data
manipulation and analysis
• Not a steady-state workload;
peaks and valleys
• Combination of structured and
unstructured data in many
formats
AWS Cloud:
• Virtually unlimited capacity
• Experimental usage cost through
on-demand infrastructure
• Scalable infrastructure for highly
variable workloads
• Tools & Services for managing
structured, unstructured and
stream data