Unlocking Self-Service Big Data Analytics on AWS

Put your data to work with Big Data
services from AWS and Qubole

Dharmesh (Dash) Desai
Technology Evangelist
Qubole
Speakers
Rahul Bhartia
Solutions Architect
Amazon Web Services

Data is growing
of new data will be
created every second
for every human being
on the planet by 2020
http://www.whizpr.be/upload/medialab/21/c
ompany/Media_Presentation_2012_DigiUn
iverseFINAL1.pdf
1.7MB
compound annual
growth rate of 58%
surpassing $1 billion by
2020 forecasted for the
Hadoop market
http://www.ap-institute.com/big-data-
articles/big-data-what-is-hadoop-
%E2%80%93-an-explanation-for-
absolutely-anyone.aspx
http://www.marketanalysis.com/?p=279
58%
of all data is ever
analyzed and used at
the moment
http://www.technologyreview.com/news/51
4346/the-data-made-me-do-it/
0.5%<

Big Data is for everyone
The market for Big Data technologies is growing more than six times faster than the
information technology market as a whole….
…and those companies who use their data well win.

Why AWS for Big Data?
Immediately
Available
Broad and Deep
Capabilities
Trusted and
Secure
Scalable

Collect, Store, Analyze, and Visualize
It’s easy to get data to AWS, store it securely, and analyze it with the engine of your choice,
without any long-term commitment or vendor lock-in
Collect
Import/Export
Snowball
Direct Connect
VM Import/Export
Store
Amazon S3
EMR
Amazon Glacier
Amazon Redshift
DynamoDB
Analyze
Amazon Kinesis
Lambda
EMR
EC2
Aurora

AWS provides the most complete
platform for Big Data
What can you do with Big Data on AWS?
Big Data Repositories Clickstream Analysis ETL Offload
Machine Learning Online Ad Serving BI Applications

Migrating your Big Data to
the Cloud

Qubole cloud advantage
 More efficient
 Faster time-to-value
 Resiliency and fault tolerance
 Elasticity
 Lower TCO
 Flexible

Data as a differentiator
Guided search

Big Data deployments are difficult
Where Big Data falls short
Rigid and
inflexible
infrastructure
Non adaptive
software
services
Highly
specialized
systems
Difficult to build
and operate

Big Data deployments are difficult
Qubole Confidential
Big Data Deployments are difficult
months to implement
6-18
succeed
27%
achieve full-scale
production
13%
cite skills gap as a
major inhibitor
57%
Where Big Data falls short
Source:
https://www.capgemini-consulting.com/resource-file-access/resource/pdf/cracking_the_data_conundrum-big_data_pov_13-1-15_v2.pdf
http://www.gartner.com/newsroom/id/3051717

Simplicity on the Cloud
SaaS self-service platform
Qubole Advantage

Data Analysts - Visualize data with
Qubole’s Notebook

Data Analysts - Build queries using
Qubole’s SmartQuery

Data Engineers - Build workflows and schedule
jobs for automation

Data Engineers - Analyze data with
Qubole’s Workbench

Data Admins - Use Control Panel to manage clusters

Data Admins - Use Qubole to manage roles,
groups and users

Data Admins - Use Qubole to monitor Cluster Usage

Data Admins - Use Qubole for detailed Cluster Usage

Data Admins - Use Qubole’s built-in
Ganglia Monitoring

Scalability on the Cloud
Provisioning, Management, Autoscaling
Qubole Advantage

On-premise HDFS cluster
 Compute & storage live together
 Compute & storage scale together
 Provisioned for peak capacity
 Cluster must be persistently on
Qubole Confidential
C+S C+S C+SC+S
C+S C+S C+SC+S
C+S C+S C+SC+S

C C CC
C C CC
C C CC
Amazon
S3
C C CC
C C CC
C C CC
Compute and storage separated on the Cloud

Auto-scales back up
when batch jobs start
Take advantage of the scale of the Cloud
Unlimited compute capacity
3:30 p.m.
Downscaling
7:00 p.m. Min
cluster size
C C CC C C CC

Take advantage of the scale of the Cloud
Instance type flexibility
instance types
40+
integration with AWS
reserved instances
different instance types
used
37

On-premises to the Cloud
Qubole Confidential
Qubole’s Hadoop Migration Service

Migrate workload to the cloud
Any on-premises Hadoop distro Data consistency and unified data visibility between on-premises and cloud
Cloud migration use cases
Pain
Maxed-out on-prem cluster
Requirements
Data in-synch during migration
Decommission on prem workload
24x7 data replication, no data loss
No downtime

Migrate workload to the cloud
Pain
Maxed-out on-prem cluster
Requirements
Data in-synch during migration
Decommission on prem workload
No downtime
Solution
Data  Cloud
Apps/data pipelines  QDS

Workload burst out to the cloud
Pain
Workload spikes  Can’t be processed on-prem
Requirements
No downtime
Bi-directional replication

Workload burst out to the cloud
Pain
Workload spikes  Can’t be processed on-prem
Requirements
No downtime
Bi-directional replication
Solution
Sync On-Prem Data  Cloud
Results  On-Prem
Workloads  QDS

Move test/dev environment to the Cloud
Any on-prem Hadoop distro Data consistency and unified data visibility between on-prem and Cloud
Pain
Shared cluster  Production
Requirements
Periodic replication  No data loss
No downtime
Development  Limit

Move test/dev environment to the Cloud
Any on-prem Hadoop distro Data consistency and unified data visibility between on-prem and Cloud
Pain
Shared cluster  Production
Requirements
Periodic replication  No data loss
No downtime
Solution
Free on-prem resources
Apps/data pipelines  QDS
Data subset  Cloud
Development  Limit

Qubole’s Hadoop Migration Service
Qubole Confidential
Hadoop
Spark
Presto
Hive
HBase
S3
Cloudera
CDH
Hortonworks
HDP
MAPR

Journey to the Cloud
Qubole Case Study Media Math

Use cases
Qubole Confidential
Build customer
profiles
Simplify attribution
insights
Qubole case study:
Segment audiences

Strength in numbers
Qubole Confidential
Each record = financial transaction
Qubole case study:
impression
opportunities a day
180B
peak qps of data/day
(compressed)
3+M 3+TB

Non-trivial challenges
Qubole Confidential
Transforming
semi-structured
data
Qubole case study:
Repeatable data
pipelines
Upfront
investment &
commitment

“We needed something that was reliable
and easy to learn, setup, use and put into
production without the risk and high
expectations that comes with committing
millions of dollars in upfront investment.
Qubole was that thing.”
Marc Rosen Sr. Director, Data Analytics
The solution – Qubole
Qubole case study:

Analytics
Spark/Hive
(with Amazon Redshift connector)
Qubole case study:
Qubole at MediaMath
Product
Hive
Engineering
Spark/Hive
Business Analysts
SmartQuery
Data Science
Spark (Scala)

Don’t have to worry about this anymore!!!
Qubole Confidential
Qubole case study:

Thank you!
Qubole Confidential
@iamontheinet
dharmesh@qubole.com

Questions & Answers
Technology Evangelist, Qubole
Email: dharmesh@qubole.com
Twitter: @iamontheinet #BigDataQube
LinkedIn: www.linkedin.com/in/dharmesh
dashdesai

Unlocking Self-Service Big Data Analytics on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Unlocking Self-Service Big Data Analytics on AWS

Similar to Unlocking Self-Service Big Data Analytics on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Unlocking Self-Service Big Data Analytics on AWS