More Related Content Similar to Leveraging the cloud for analytics and machine learning 1.29.19 (20) More from Cloudera, Inc. (17) Leveraging the cloud for analytics and machine learning 1.29.191. © Cloudera, Inc. All rights reserved.
Migrating Analytics and ML to the Cloud
Sushant Rao
Cloud Product Marketing @ Cloudera
Ron Abellera
Azure Global Black Belt @ Microsoft Azure
2. © Cloudera, Inc. All rights reserved. 2
Poll Question 1: Where are you in your journey to the Cloud?
● Just started researching options in Cloud
● Starting to test different products / services in Cloud
● Have some deployments and looking to expand in Cloud
● Critical mass in the Cloud
3. 3 © Cloudera, Inc. All rights reserved.
Why Cloud?
CLOUD
BENEFITS
CLOUD
PROBLEMS
• Agility
○ Speed of making changes to meet business / technical needs
• Scalable & Elastic
○ Scale up and down quickly
• Reliable
○ Multiple options to ensure infrastructure / services are available
○ Tenant isolation ensure different workloads don’t conflict with each other
• Other
○ Pay-as-you-go charges only for consumption (but not necessarily cheaper)
○ Self-service enables users to do their work without contacting IT / Data platform team
4. 4 © Cloudera, Inc. All rights reserved.
But ...
CLOUD
PROBLEMSCLOUD
CHALLENGES
• Multiple copies of data & Disjointed services
○ Different services have their own copies and may not work together
• On-premises integration
○ Data gravity is on-prem, so cloud needs to complement current data platform
• Cloud Lock-in
○ Open source prevented lock-in for on-prem. What about cloud?
• Shadow IT
○ Individual business units may setup up their own cloud deployments, without
the architecture, security, and/or governance of the on-prem deployment
• Cheaper?
○ On-prem can be more than 2x cheaper than cloud
5. 5 © Cloudera, Inc. All rights reserved.
Common Uses Cases for Cloud
CORPORATE DIRECTIVE
• C-level has decided to
utilize the cloud more
• Running out of data center
space, looking for more
agility / flexibility
6. 6 © Cloudera, Inc. All rights reserved.
Common Uses Cases for Cloud
CORPORATE DIRECTIVE DISASTER RECOVERY
• C-level has decided to
utilize the cloud more
• Running out of data center
space, looking for more
agility / flexibility
• Backup all data to the
cloud, without a second
“physical” location
• Save time and expense of
setting up a physical DR
site
7. 7 © Cloudera, Inc. All rights reserved.
Common Uses Cases for Cloud
CORPORATE DIRECTIVE ELASTIC WORKLOADSDISASTER RECOVERY
• C-level has decided to
utilize the cloud more
• Running out of data center
space, looking for more
agility / flexibility
• Separate environment for
new, production or for
intermittent, ad-hoc
workloads
• Takes too long to acquire
and setup on-prem
infrastructure
• Backup all data to the
cloud, without a second
“physical” location
• Save time and expense of
setting up a physical DR
site
8. 8 © Cloudera, Inc. All rights reserved.
Common Uses Cases for Cloud
CORPORATE DIRECTIVE SANDBOXELASTIC WORKLOADSDISASTER RECOVERY
• C-level has decided to
utilize the cloud more
• Running out of data center
space, looking for more
agility / flexibility
• Environment to test queries
and algorithms
• Doesn’t impact production
cluster as data analysts
and engineers test
• Separate environment for
new, production or for
intermittent, ad-hoc
workloads
• Takes too long to acquire
and setup on-prem
infrastructure
• Backup all data to the
cloud, without a second
“physical” location
• Save time and expense of
setting up a physical DR
site
9. 9 © Cloudera, Inc. All rights reserved.
Cloudera’s Solution for Data Analytics / Engineering in Cloud
• The modern platform for machine learning and analytics
• Numerous functions for all types of jobs and queries
• with multiple deployment options
• On-premises, Public cloud (including multi-), and Hybrid
• and one shared data experience
• Framework for consistent security, governance, and metadata management across
applications and deployments
10. 10 © Cloudera, Inc. All rights reserved.
The Modern Platform for Machine Learning & Analytics
OPERATIONAL
DATABASE
DATA
ENGINEERING
DATA
WAREHOUSE
DATA
SCIENCE
DATA PROCESSING
• Cost efficient
• Reliable
• Scalable
• Based on Spark,
MapReduce,
Hive & Pig
• Supported by
Workload
Analytics
FAST BI & SQL
• Flexibility
• Elastic scale
• Go beyond SQL
• Based on
Impala & Hive
• SQL dev enviro
• Supported by
Workload
Analytics
MACHINE LEARNING
• Fast dev to
production
• Secure self-
serve
• Based on
Python, R, and
Spark
• ML dev
environment
(CDSW)
ONLINE & REAL-TIME
• High throughput,
low latency
• Strongly
consistent
• Based on
Hbase, Kudu
& Spark
streaming
11. 11 © Cloudera, Inc. All rights reserved.
Cloudera’s Vision for AI and Machine Learning
Modern Enterprise Platform, Tools, and Expert Guidance to help you Unlock
Business Value with ML / AI
Agile platform to build,
train, and deploy
scalable ML
applications
Enterprise data science
tools to accelerate
team productivity
Expert guidance,
services & training to
fast track value & scale
12. 12 © Cloudera, Inc. All rights reserved.
With Multiple Deployment Options
Via Cloudera Altus (IaaS)
INFRASTRUCTURE SERVICES
OPERATIONAL
DATABASE
DATA
ENGINEERING
DATA
WAREHOUSE
DATA
SCIENCE
DATA
ENGINEERING
DATA
WAREHOUSE
Via Cloudera Altus Services (PaaS)
Traditional Infrastructure
(combined storage and compute)
Cloud Infrastructure
(decoupled storage and compute)
Cloud Infrastructure
(decoupled storage and compute)
13. © Cloudera, Inc. All rights reserved. 13© Cloudera, Inc. All rights reserved.
Cloudera
Enterprise Data
Platform
Benefits for IT infra & ops
• Central control and security
• Focus on curating not
firefighting
Benefits for users
• Value from single source of
truth
• Bring the best tools for each
job
WORKLOADS DATA
SCIENCE
DATA
WAREHOUSE
OPERATIONAL
DATABASE
DATA
ENGINEERING
3RD PARTY
SERVICES
COMMON
SERVICES
SECURITY GOVERNANCE LIFECYCLE
MANAGEMENT
CONTROL
PLANE
DATA CATALOG
STORAGE
HDFS
Public Cloud
Object Storage
(S3, ADLS, etc)
KUDUPrivate Cloud
Object Storage
14. © Cloudera, Inc. All rights reserved. 14
Journey to the Cloud from On-Prem
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Analytics
Data
Science
Security
Metadata
Governance
STORAGE
HDFS
ON PREMISES
Current State
● Multiple workloads and services run in a single cluster
● Data Context (security, metadata, governance) in single
cluster
Goals in Journey to the Cloud
● Get to Cloud with minimal impact and change
● Replicate security groups and permissions in the Cloud
● May require multiple stages to get there
● First step may vary depending on goals
● Need to determine how data will be replicated to the
Cloud
15. © Cloudera, Inc. All rights reserved. 15
CUSTOMER CLOUD (AWS, Azure, GCP, etc)
Start by Replicating Data to Public Cloud via BDR
ON PREMISES
STORAGE
HDFS
PUBLIC CLOUD
HDFS
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Hive
Impala
Spark
Sentry
HMS
STORAGE
HDFS
Navigator
BDR
16. © Cloudera, Inc. All rights reserved. 16
CUSTOMER CLOUD
Journey to the Cloud - Step 1
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Analytics
Data
Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
1- LIFT AND SHIFT
HDFS
17. © Cloudera, Inc. All rights reserved. 17
CUSTOMER CLOUDCUSTOMER CLOUD
Journey to the Cloud - Step 2
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Analytics
Data
Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
1- LIFT AND SHIFT
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Analytics
Data
Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
2 - OBJECT STORAGE
HDFS
18. © Cloudera, Inc. All rights reserved. 18
CUSTOMER CLOUD
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Analytics
Data
Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
CUSTOMER CLOUD
Journey to the Cloud - Step 3
CLOUDERA CLUSTER
(PERSISTENT)
COMPUTE DATA
CONTEXT
Data
Engineering
Analytics
Data
Science
Security
Metadata
Governance
STORAGE
CLOUD OBJECT STORE
1- LIFT AND SHIFT 2 - OBJECT STORAGE
HDFS
CLOUDERA
CLUSTERS
(TRANSIENT–
ALTUS)
COMPUTE
Data
Engineering
CUSTOMER CLOUD CLOUDERA CLOUD
CLOUDERA
ALTUS
CONTROL
PLANE
STORAGE
CLOUD OBJECT STORE
DATA
CONTEXT
CLOUDERA CLUSTER
(PERSISTENT–DIRECTOR)
COMPUTE DATA
CONTEXT
CLOUDERA
CLUSTERS
(TRANSIENT–
ALTUS)
COMPUTE
Analytics
3 - CLOUD NATIVE ARCHITECTURES
19. © Cloudera, Inc. All rights reserved. 19
Customer Examples
Many Cloudera customers (Global 5K) used public cloud
• Online retailer
• Over 2,000 nodes with ~2PB of data in cloud running in an active - active configuration
• Transforming data with Spark and then analyzing with Apache Hive
• German chain of coffee retailers and cafés
• 30+ nodes with 50TB of data in cloud
• Modern Cloudera platform with an Impala data warehouse
• Global information company
• 70+ nodes in cloud across Microsoft Azure and AWS
• Replaced Netezza with Hadoop and leveraging both Impala and Spark for analytics
20. © Cloudera, Inc. All rights reserved. 20
Cloudera is using cloud as well
Security Use Case
Altus based solution saved more than 50% cost compared to initial implementation
21. © Cloudera, Inc. All rights reserved. 21
Cloudera Altus
Key Differentiators
• Multi-function: Unified platform for data engineering, data warehouse, and
data science
• Multi-cloud: Option for on-premises, Public cloud (including multi-), and
Hybrid
• SDX: Integrated shared data experience across multi-function clusters
22. © Cloudera, Inc. All rights reserved.22 © Cloudera, Inc. All rights reserved.
Pick the Right Altus Component for Your Needs
Depending on workload and service level
• Service offering for batch
oriented Data
Engineering jobs on data
in object stores (ADLS,
others)
• Usage based pricing
• Runs Apache Spark,
Apache Hive and
MapReduce jobs
• Provides Workload
Analytics to troubleshoot
and optimize job
performance
• Service offering for cloud
native data warehouse
use cases
• Usage based pricing
• Runs Apache Impala on
data stored in object
stores (ADLS, others)
• Exposes endpoint to
connect BI Tools for
visualization
• Offers built-in SQL Editor
for ad-hoc data
exploration
• EDH for public cloud
which gives customers
full cluster control
• Self-managed cloud
infrastructure
• Usage or node based
pricing
• Full breadth of CDH
services available
(Apache Kafka, Apache
Spark Streaming, CDSW,
etc)
• Supports deployments
on 5 public cloud
platforms
Altus Data Engineering (PaaS) Altus Data Warehouse (PaaS) Altus (IaaS)
24. Cosmos
Microsoft’s internal data lake
• A data lake for all teams @Microsoft
• Tools approachable by any developer
• Batch, Interactive, Streaming, ML
• Used across Office, Xbox, Azure,
Windows, Ads, Bing, Skype, …
By the numbers
• Exabytes of data
• 100Ks of Physical Servers
• Millions of Interactive Queries
• Huge Streaming Pipelines
• 100Ks of Batch Jobs
• 10K+ Developers
Microsoft’s Big Data Service
Azure Data Lake
A data lake for everyone
• The next version of Cosmos
• Fully aligned with Hadoop ecosystem
and standards, with full support for
Hadoop tools and engines as well as
unique Microsoft capabilities
• Migration from Cosmos to ADL is
already underway
• External customers on the same
service as internal customers
25. Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
26. Azure Data Lake Overview
Windows Azure Blob Storage
Spark
Map-
Reduce
Impala
Cloudera
Azure Key
Vault
Azure
Active Dir
Azure Data Lake Store – in-cluster services
U-SQL
ADL Analytics
…
Ingestion Service
ADLS Gateway Service
Cosmos API HDFS++ API
HDFS++ API
Scope
YARN
ADLS Micro
Services
ADL local tier
Azure VMs
Azure remote storage tier
27. ADLS Gen 2
• Preview announced June 2018
• Allows all storage regions to have HDFS API
• Soon available for Cloudera implementations
30. © Cloudera, Inc. All rights reserved.30 © Cloudera, Inc. All rights reserved.
Poll Question 2: How do you want to use the Cloud?
• Migrating existing workloads from your on-prem cluster to Azure
• Deploying new data analytics / engineering jobs in Cloud (PaaS / SaaS)
• Interested in both of the above
• Not sure
31. © Cloudera, Inc. All rights reserved. 31© Cloudera, Inc. All rights reserved.
Cloud Data Analytics / Engineering with Cloudera
$
• Lower risk of data breach
• Analysts more productive on jobs
• Self-service (no shadow IT) and
more productive
• IT more strategic, less admin time
• Deployment choices and no lock-in
• Same solution as on-premises and multi-
cloud
• Eliminate data copies
• Single security framework with
universally shared metadata
• Easy to track data lineage
• Unified services
+
ADVANTAGES
BUSINESS
VALUE
• Lower risk of data breach
• Analysts more productive on jobs
• Self-service (no shadow IT) and
more productive
• IT more strategic, less admin time
• Deployment choices and no lock-in
32. © Cloudera, Inc. All rights reserved. 32© Cloudera, Inc. All rights reserved.
Ready to try the Cloud?
$10K of free Azure credits!
• Cloudera and Microsoft will
offer $10,000 in FREE Azure for
qualifying opportunities
• To be applied to Azure
subscription
• Must be consumed in 60 days
• Must be a Cloudera product
running on Microsoft Azure
• Must be tied to a single
customer entity for PoC or pilot
deployment
• Limited time offer
• Contact
azureoffer@cloudera.com
35. 35 © Cloudera, Inc. All rights reserved.
Cloudera Pricing / Acquisition
Acquisition Options
● Pay-as-you-go usage-based pricing
● Node-based license subscription
● Free 30-day trial
● Pre-pay of cloud credits
● Free version that can be deployed in the cloud
Pricing - https://www.cloudera.com/products/pricing.html