Big Data Architectural Patterns and Best Practices on AWS

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Olivier Klein
Senior Solutions Architect
June 2016
Big Data Architectural Patterns
and Best Practices on AWS

Three Types of Data Analytics
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
apps

Ingest Store Process Visualize
Data Answers
Time
Simplified Big Data Pipeline

Amazon S3
Amazon
DynamoDB
Amazon RDS
Amazon Mobile
Analytics
Amazon EMR
Amazon Redshift
Amazon
Lambda
Amazon Kinesis
Firehose
Amazon Machine
Learning
Amazon
EC2
Amazon
Glacier
Amazon
Elasticsearch
Service
Amazon
Kinesis
Analytics
Amazon QuickSight
AWS Import/
Export Snowball
Amazon
Kinesis

Fluentd: Open Source Log Collection
•  Fluentd is an open source
data collector to unify data
collection and consumption
•  Integration into many data
sources (App Logs, Syslogs,
Twitter etc.)
•  Direct integration into AWS
such as S3 & Kinesis
<source>
type tail
format apache2
path /var/log/apache2/access_log
tag s3.apache.access
</source>
<match s3.*.*>
type s3
s3_bucket myweblogs
path logs/
</match>

Amazon S3
•  Highly available object storage designed
for 99.999999999% data durability
•  Replicated across 3 facilities
•  Virtually unlimited scale
•  Pay only for what you use, you don’t
need to pre-provision
•  Allows event notifications to trigger
further action
•  Ideal for a data lake (single source of truth)
Amazon S3

Amazon DynamoDB
•  Schemaless Data Model
•  Seamless scalability
•  No storage or throughput limits
•  Consistent low latency performance
•  High durability and availability
•  Replicated across 3 facilities
DynamoDB
table
items
a*ributes
Fully Managed NoSQL Database Service

500,000 writes / second to their Amazon
DynamoDB tables

Stream in Real Time: Amazon Kinesis
•  Real-Time Data Processing over
large distributed streams
•  Elastic capacity that scales to
millions of events per second
•  React In real-time upon incoming
stream events
•  Reliable stream storage
replicated across 3 facilities
Amazon Kinesis

Amazon S3
Amazon
DynamoDB
Amazon RDS
Amazon Mobile
Analytics
Amazon EMR
Amazon Redshift
Amazon
Lambda
Amazon Kinesis
Firehose
Amazon Machine
Learning
Amazon
EC2
Amazon
Glacier
Amazon
Kinesis
Analytics
Amazon QuickSight
AWS Import/
Export Snowball
Amazon
Kinesis
Amazon
Elasticsearch
Service

Amazon EMR
•  Amazon EMR is a fully managed
Hadoop cluster
•  Transient and long running clusters
•  Direct integration into Amazon S3
and Amazon Kinesis
•  Easy to scale and enable burstable
capacity
•  Integration with AWS Spot Market

1 instance x 100 hours = 100 instances x 1 hour
(and with Spot Pricing not only faster but also cheaper)

Process – Amazon EMR
•  Amazon EMR supports all common
Hadoop Frameworks such as:
•  Spark, Pig, Hive, Hue, Oozie …
•  Hbase, Presto, Impala …
•  Decouples storage from compute
•  Allows independent scaling
•  Direct Integration with DynamoDB
and S3 (EMRFS)
Amazon S3
Amazon
DynamoDB
Amazon EMR

•  FINRA regulates trading practices of
brokerage firms and exchange markets to
protect market integrity
•  Market surveillance platform stores
30 billion market events every day
•  Leverages Amazon S3 to store events
and allow analysts to interactively query
market dynamics using Amazon EMR
Hive & HBase clusters with increased
agility
Re-Architecting Compliance
Unlimited
Storage
Distributed
Computing
Interactive Market
Queries
Ensure
compliance
30 billion market
events

Apache Spark
•  Apache Spark is an in-memory
analytics cluster using RDD (Resilient
Distributed Dataset) for fast processing
•  Faster than Map-Reduce due to
removal of shuffling phases to HDFS
•  Apache Spark Streaming can read
directly from DynamoDB, S3 and a
Kinesis stream

Processing Amazon Kinesis streams
Amazon
Kinesis
EMR with
Spark Streaming
KinesisUtils.createStream(‘twitter-stream’)
.filter(_.getText.contains(‘Big Data’))
.countByWindow(Seconds(5))
Counting tweets on a sliding window

Amazon Redshift
•  Fully managed petabyte-scale data
warehouse
•  Scalable amount of cluster nodes
•  ODBC/JDBC connector for BI tools
using SQL
•  Supports Amazon DynamoDB and
Amazon S3 to load data
•  Less than a 10th of a cost of traditional
solutions
Amazon Redshift

Amazon S3
Amazon
DynamoDB
Amazon RDS
Amazon Mobile
Analytics
Amazon EMR
Amazon Redshift
Amazon Kinesis
Firehose
Amazon Machine
Learning
Amazon
EC2
Amazon
Glacier
Amazon
Kinesis
Analytics
AWS Import/
Export Snowball
Amazon
Kinesis
Amazon
Lambda
Amazon
Elasticsearch
Service
Amazon QuickSight

AWS Marketplace
•  Pre-Configured machine images
ready to be launched into virtual
server instances
•  Launch applications with 1-Click
•  Pay software licenses by the
hour or bring your own license
(BYOL)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fu Ting Chan, Founder
InfoForce.co
June 2016
Customer Sharing: Scalable Big Data
Intelligence with Machine Learning

InfoCluster
InfoForce is a cloud based BIG DATA solution provider specializing in Near Real-Time High Throughput
Analytics and Machine Learning algorithms
Application Server via REST
Organization
Data Lakes
InfoForce.co
Data Pipes Analytics
Machine
Learning

e-commerce
Marketplace
~4mil
product
> 1,800
sellers
Historical
purchase
Log ﬁles
Relational
DBs
TB of
product
catalog
1. Price Optimizer
2. Cross-sell bundler
3. Recommendation 
Engine

We reviewed 1,800+ merchant, total 3.5 million SKU, transaction volume US $80 mil in last 90 days and
found that over 86% listings don’t have any sales performance in last 60 days.*
2% 3%
3%1%
5%
86%
Hi-exposure Mid-hi exposure Mid-low exposure
New listed 7 days New listed 30 days Low exposure
Condition %
Hi Exposure 1.9%
Mi-hi Exposure 3.4%
Mid-Low Exposure 2.8%
New listed in 7 days 0.6%
New listed in 30 days 5%
Low Exposure 86.3%
Total 100%
Deﬁnition!
!
1.  High exposure - sold in last 7 days!
2.  Mid-high exposure - sold in last 30 days but no in last 7 days!
3.  Mid-low exposure - sold in last 60 days but no in last 7 days!
4.  Low exposure - No sold history in last 60 days!
Understand the e-commerce market

Instead of setting ﬁxed selling price, Price optimiser offer DYNAMIC Pricing strategy. Based on product sales
cycle and compare similar product sales cycle in market to design price rules for each merchandise.
6050 Strategies
Product 
Similarity
Automatically Managed Listings
1. Price Optimization

To discover product‘s relationship, InfoForce based on textual analysis engine derives relationship from
catalog meta-data and machine learn from consumption patterns.
Flora By Gucci Eau De Parfum Spray 50ml
gucci, flora, eau_de_parfum_spray, 50ml
Brand Model Product
Gucci, eau_de_toilette
Gucci, flora
eau_de_parfum
75ml
Vera_Wang, floral
eau_de_parfum_spray
50ml
2. Cross-sell Bundler

2. Machine Learning
gucci, flora,
eau_de_parfum
_spray, 50ml
gucci, eau_de_toilette
gucci, flora
eau_de_parfum
75ml
vera_wang, floral
eau_de_parfum_spray
50ml
Derived Tags
gucci -> salvatore_ferragamo; fendi; prada; hermes; cartier
flora -> floral,flower, pageant; flowers,wedding
eau_de_parfum ->eau_de_toilette; colonia; perfume; eau_de_parfum; deodorant;

Scalability
TBs of Data
Compute & IO heavy
Dynamically grow
Timeliness
Streaming data
Results needs to be
available FAST
Always available
Secure
Privacy
Fine grain control on
resources
Persistence & Backup
Challenges

Challenge No More
EC2
Solid platform to scale our compute
capabilities. Leveraging AWS IAM
services to grant API level access to spin
up additional machines when necessary.
Kinesis Streaming
Core to our architecture are the real-time data
pipes which is built on top of AWS Kinesis
where we can provision shards number based
on throughput requirements without having to
worry about the complexities of setting up a
production grade streaming service
DynamoDB
Serve as a reliable way to persist over
important meta data such as check
pointing and stream schema info
Kinesis Firehose
Firehose is used to archive any data
received in from the data pipes for long
term batch and real-time analytics by the
calculation cluster
S3 and CloudFront
S3 is used as a way to persist data, it
also has handy integrations into Apache
Spark for distributed easy distributed
computing. CloudFront is used as a CDN
for fast raw data delivery into apps
CloudWatch
InfoCluster uses built-in and custom
CloudWatch metrics to store and monitor
services, the useful alarm functionality
notiﬁes the operation team of any issues
Scalability Timeliness Secure

AWS Services Usage (1)
Amazon EC2
Solid platform to scale our compute capabilities.
Leveraging AWS IAM services to grant API level
access to spin up additional machines when
necessary.
Amazon Kinesis Firehose
Firehose is used to archive any data received in
from the data pipes for long term batch and real-
time analytics by the calculation cluster

Amazon Kinesis
Core to our architecture are the real-time data pipes which are
built on top of Amazon Kinesis where we can provision shards
number based on throughput requirements without having to
worry about the complexities of setting up a production grade
streaming service.
Amazon S3 and Amazon CloudFront
S3 is used as a way to persist data, it also has handy integrations
into Apache Spark for easy distributed computing. CloudFront is
used as a CDN for fast raw data delivery into apps

Amazon DynamoDB
Serve as a reliable way to persist over important meta
data such as check pointing and stream schema info
Amazon CloudWatch
InfoCluster uses built-in and custom CloudWatch metrics
to store and monitor services. The useful alarm
functionality notifies the operation team of any issues.

Price Optimizer
>700,000 items actively
managed
Generated > $1.4mil USD
revenue for merchants,
counting 51,014 sold pcs.
Cross-Sell Bundler
Successful integration with word
tagging infrastructure
Soft launch early Jun 2016.
Generate independent bundling result
in 30 sec for 2,000 items with 10
bundle recommendation each.
Winner of HK ICT Awards 2016
Best Startup
Best Smart Hong Kong (Big
Data Application)
Results

Merchant provide 50K listings to adopt our price optimizer, which 2% of it (around 1,000 listings) were re-
activated to sales and generate over US $80K GMV in 20 days.
Successful Case

•  Business
•  Look for application of technology in other domain / industries
•  Plug-in / API Level access to InfoCluster’s functionalities
•  Further investment into machine learning
•  Technology
•  Storage optimization using ST1 and SC1 EBS
•  Optimization in balance between continuous resource vs compute
services such as lambda
•  Full integration of custom processes with AWS CloudWatch
•  PubSub notiﬁcation framework
Future Plan

•  We have progressed so far beyond justifying cloud beyond cost...
•  Builders like to build
•  Unlock your organization data lakes, use InfoForce (preferably !)
Summary

“We see our customers as
invited guests to a party,
and we are the hosts. It’s
our job to make every
important aspect of the
customer experience
a little bit better.”
Jeff Bezos
CEO, Amazon.com

Data analysis for a better customer experience
•  Your business creates and stores
data and logs all the time
•  Data points and logs allow you to
understand individual customer
experience and improve it
•  Analysis of logs and trails help
gain insights

Big Data:

•  Massive Datasets
•  Experimental style of data
manipulation and analysis
•  Not a steady-state workload;
peaks and valleys
•  Combination of structured and
unstructured data in many
formats
AWS Cloud:

•  Virtually unlimited capacity
•  Experimental usage cost through
on-demand infrastructure
•  Scalable infrastructure for highly
variable workloads
•  Tools & Services for managing
structured, unstructured and
stream data

Thank you!
Olivier Klein
Senior Solutions Architect
Fu Ting Chan, Founder
InfoForce.co

Big Data Architectural Patterns and Best Practices on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Architectural Patterns and Best Practices on AWS

Similar to Big Data Architectural Patterns and Best Practices on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Big Data Architectural Patterns and Best Practices on AWS