SlideShare a Scribd company logo
1 of 38
Download to read offline
Big Data in the Cloud
State of the Union and Future Trends
Ashish Thusoo
Qubole CEO & Co-Founder
00Copyright 2017 © Qubole
About Me
Today - CEO & Co-founder of Qubole
➢ Big Data Platform built for the Cloud
• Scaling to process 700+ PB of data a month in popular Clouds (AWS, Microsoft Azure, Oracle)
• Provides Auto-Orchestration and Self-Service for big data jobs (e.g. ETL, Machine Learning, Adhoc)
• Cloud and workload optimized Spark, Hive, Hadoop and Presto Engines
Background
➢ Started career at Oracle as a Principal Architect
➢ Ran Data Infrastructure team at Facebook from 2007-2011:
• Built out the Self-service Big Data Platform at Facebook for internal operations
• Saw huge growth, while chartered to provide all teams unified analytics (Marketing, Analytics,
Engineering, Sales Finance, etc.)
• Spawned developments of big data engines such as Apache Hive and Apache Presto
➢ Co-created and led Apache Hive project
From Data Warehouse to Data Lake
Evolution of Big Data
00Copyright 2017 © Qubole
Changing Nature of Data
00Copyright 2017 © Qubole
Changing Nature of Analytics
Descriptive
Analytics
Diagnostic
Analytic
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we
make it happen?
DIFFICULTY
VALUE
Analytics Value Escalator
00Copyright 2017 © Qubole
Breakdown of Data Warehouse – Emergence of Data Lake
ENTERPRISE DATA
WAREHOUSE
ETL
BI
Data
Mining
DATA LAKE
INGEST
ELT ML
Data
Discovery
Data
Warehouse
LEGACY DATA PLATFORM MODEREN DATA PLATFORM
00Copyright 2017 © Qubole
Differences Between Data Lakes and Data Warehouses
DATA LAKE vs DATA WAREHOUSE
Semi-structured / unstructured /
structured / raw DATA
Structured data
SQL / Machine Learning / ETL /
Graph Analytics etc. ANALYTICS FLEXIBILITY
SQL
Cheap storage for large volumes
of data VOLUME
Expensive at large volumes of
data
High agility with ability to quickly
reconfigure for new workloads AGILITY
Fixed configuration and limited
agility
Data Engineers / Data Scientists /
Analysts USERS
Analysts / Business Users
00Copyright 2017 © Qubole
Data Back Office with a Data Warehouse Centric Approach
Marketing
Ops
DeveloperSales
Ops
Product
Managers
Data Infrastructure
Data Team
Operations
Analyst
AnalystData
Architect
Business
Users
Product
Support
Customer
Support
00Copyright 2017 © Qubole
Data Back Office Transformation with a Data Lake
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure
State of Data Lake Adoption in Enterprises
00Copyright 2017 © Qubole
The Five Stages of Data Lake Maturity
01
Stage
Aspiration
- Production
Reporting/DW
- Researching
02
Stage
Experimentation
- Initial Big Data
Deployment
- Targeted Use
Case
03
Stage
Expansion
- Multiple
Departments
- Multiple Engines
- Top Down Use
Cases
04
Stage
Inversion
- Enterprise
Transformation
- Bottoms up
use cases
-
05
Stage
Nirvana
- Digital
Enterprise
- Ubiquitous
Insights
- True Business
Transformation
00Copyright 2017 © Qubole
Is there a plan in place
to move to a big data
self-service analytics
model?
Yes
65%
No
35%
Data Lake’s Reality Gap: Everyone wants self-service nirvana
65% Moving to Self-Service Model to Enable Data Professionals
00Copyright 2017 © Qubole
How confident are you
that the data team can
achieve self-service
analytics?
16% 39% 32% 11% 2%
0% 20% 40% 60% 80% 100%
Extremely confident
Very confident
Confident
Somewhat doubtful
Extremely doubtful
Data Lake’s Reality Gap: IT is confident they can get there
87% Confident They Can Provide Self-Service Analytics
00Copyright 2017 © Qubole
Data Lake’s Reality Gap: Only 8% are there today
How do you assess
your big data
maturity?
In process but still
have a lot to learn
35%
Need to scale and
optimize our
processes
28%
Still figuring it out
22%
Fully mature
process providing
tools and access
to insights for all
stakeholders
8%
We have proven
processes but still
can’t meet
business demand
7%
Only 8% Have Mature Big Data Processes
00Copyright 2017 © Qubole
A Prescription to Success - Move to the Cloud
Adaptability
• Best machine configuration for the workload
• Best Engine for the workload
• On demand and elastic; automatically scale up
or down
Agility
• Initial provisioning in min/hours, not months
• Change configurations dynamically
• Compute and Storage scale independently
Cost
• Pay only for what you actually use
• Use spot instances to reduce cost by up to 80%
Cloud Computing and Data Lakes
Infrastructure as a Service
00Copyright 2017 © Qubole
Changing Nature of the IT Infrastructure
Mainframe
Computing
Client-Server
Computing
Cloud
Computing
1960 1980 2000
2020
00Copyright 2017 © Qubole
Cloud vs Data Centers
• Infrastructure is an API – Therefore Infrastructure can with the Application
App
STATIC & RIGID
FLEXIBLE & ELASTIC
App
00Copyright 2017 © Qubole
Properties of Data Lakes
Big Data Analysis Is
Bursty
e.g. at Qubole we see
on an average the
minimum to maximum
size of infrastructure
to vary by 3400%
Ever
Expanding
e.g. data
processed on
Qubole as grown
2.5x in a year
Rapidly
Evolving
e.g. Spark,
Presto and
others
addressing gaps
in technology
00Copyright 2017 © Qubole
Creating Data Lakes in the Cloud
+
1. Faster Time to Production (Minutes and Hours instead of 6-18 months)
2. Increasing success in Big Data and Data Science projects
3. Agility and Flexibility of Infrastructure to change compute for workloads and
technology
4. Significant Cost Reduction (Infrastructure fits avg utilization as opposed to peak)
00Copyright 2017 © Qubole
Key Architecture – Separation of Compute and Storage
COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE
COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE
OBJECT STORAGE
ELASTIC COMPUTE
LEGACY DATA CENTER ARCHITECTURE CLOUD INFRASTRUCTURE ARCHITECTURE
00Copyright 2017 © Qubole
Architecture – Putting Together Cloud Data Lakes
00Copyright 2017 © Qubole
Cloud Data Lakes vs Data Center Data Lakes - Automation
Cluster Lifecycle Management
Auto start/terminate
Auto-scaling up/down
Performance Optimization
Cluster re-balancing
Performance/Caching
Cost Optimization
Spot node usage
Resource substitution
00Copyright 2017 © Qubole
Cloud Data Lakes vs Data Center Data Lakes - TCO
25
5
10
15
20
Clustersize(Nodes)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Time
Minimum nodes
Maximum nodes
On Prem Cluster (fixed)
Actual Compute Usage
Cluster Lifecycle Savings (19% / 90%)
Workload Aware Auto-Scaling Savings (48% / 90%)
Spot Shopper Savings (15% / 45%)
00Copyright 2017 © Qubole
Cloud Data Lakes vs Data Center Data Lakes – Concurrency and
Elasticity
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as a.county
from SMALL_TABLE a) tgroup by t.county;
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on
(a.id=b.ad_id) group by a.id, a.zip;
OBJECT STORAGE
ELASTIC COMPUTE
00Copyright 2017 © Qubole
Bringing it Together
Using the Cloud for Big Data Platforms and Data Science Operations is
Fundamentally Different from Operating Big Data Platforms On-Premise.
Done Properly This Leads to
1. Faster Time to Value
2. Increased Flexibility and Scale
3. Better Adoption of Analytics
4. Ability to Spend Smarter, and Not Less
Example Cases of Success
00Copyright 2017 © Qubole
Case Studies
Reduced Time to Market for New Products - Ibotta, Inc.
➔ Ibotta was faced with the challenges and limitations of a cloud Data Warehouse, reaching constraints
of being able to scale and leaving new data on the floor, which was prevent new features such as
personalized app experiences and campaign optimizations. This drove building a datalake, being the
source of ETL pipelines from ingestion, along with enabling Data Science and Product teams for
feature engineering.
Bursty ETL on Telemetry & User Data -
➔ Lyft crunches through all the data that an application that their drivers and riders create. Lyft
generates 3-5x more data on weekends vs. weekdays. Their data platform has to be flexible enough
to process a significant amount more on weekends in order to deliver the right insights back to their
users.
00Copyright 2017 © Qubole
Case Studies
Data Governance for Customer 360 -
➔ Turner had a initiative for giving Engineering, Data Science and Analyst breaking down their data
silos in order to provide a 360 view of their engagement from campaigns through viewership. Using
QDS, they could easily set up governance and permissions in S3, eliminating the risk of exposing
sensitive data to the wrong teams, while enabling them the right access for their tools.
Data Science Development Workbench -
➔ Acxiom had a need to share data across different data science teams with the control to lock down
certain users. With Qubole, they have built a “Data Science Playground” that allows various DS
teams have a single environment for developing and testing models and algorithms before rolling
them into production.
00Copyright 2017 © Qubole
Right Tool for the Right Team
Built for Anyone Who Uses
Data
Analysts
Data Scientists
Data Engineers
Data Admins
BI teams
Executive Dashboards
Products
A Single Platform for Any
Use Case
Reporting
Ad Hoc Queries
Machine Learning
Stream Processing
Vertical Apps
ETL Workflows
Open Source Engines,
Optimized for the Cloud
Cloud-Native, Optimized
for Agnostics
Web App User Interface, SDK (Python, Java, R, Ruby APIs), ODBC/JDBC connectors (visualization software and databases)
Workflows orchestration (Airflow, Oozie, Azkaban)
Future Directions
00Copyright 2017 © Qubole
(Re)Emergence of Deep Learning
00Copyright 2017 © Qubole
Deep Learning Applications
Applications today focused in areas around
• Image Recognition and Processing
• Speech Recognition and Processing
• NLP and Text Analysis
00Copyright 2017 © Qubole
Emergence of New Use Cases and Technology
Deep Learning Platforms are Emerging
00Copyright 2017 © Qubole
Serverless Computing
00Copyright 2017 © Qubole
Advantages of Serverless
• Zero Administration
• Fast Bursting
• Cost Advantages
00Copyright 2017 © Qubole
Emergence of Serverless Computing
Let’s Chat…
athusoo@qubole.com
Learn More at www.qubole.com
@qubole

More Related Content

What's hot

How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsInformatica
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight OverviewLam Le
 
Real-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache KafkaReal-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache KafkaCarole Gunst
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake ArchitectureDATAVERSITY
 
2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data IntegrationJeffrey T. Pollock
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSLam Le
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at WalgreensDataWorks Summit
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"DataConf
 
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...Kolja Manuel Rödel
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Jeffrey T. Pollock
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 

What's hot (20)

How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
 
Real-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache KafkaReal-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache Kafka
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Data-In-Motion Unleashed
Data-In-Motion UnleashedData-In-Motion Unleashed
Data-In-Motion Unleashed
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWS
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
 
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 

Viewers also liked

The Future of Data
The Future of DataThe Future of Data
The Future of Datablynnbuckley
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake Olga Zinkevych
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...Lucas Jellema
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopDavid Yahalom
 

Viewers also liked (7)

The Future of Data
The Future of DataThe Future of Data
The Future of Data
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 

Similar to Top Trends in Building Data Lakes for Machine Learning and AI

TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and QuboleAmazon Web Services
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and QuboleAmazon Web Services
 
State of enterprise data science
State of enterprise data scienceState of enterprise data science
State of enterprise data scienceYan Xu
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationAbdelkrim Hadjidj
 
Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnectaDigital
 
How Analytics Optimize Migration to Amazon Web Services, Microsoft Azure and ...
How Analytics Optimize Migration to Amazon Web Services, Microsoft Azure and ...How Analytics Optimize Migration to Amazon Web Services, Microsoft Azure and ...
How Analytics Optimize Migration to Amazon Web Services, Microsoft Azure and ...Enterprise Management Associates
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaSkillspeed
 
BIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsBIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsSkillspeed
 
Achieve New Heights with Modern Analytics
Achieve New Heights with Modern AnalyticsAchieve New Heights with Modern Analytics
Achieve New Heights with Modern AnalyticsSense Corp
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureKhalid Salama
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubolekbajda
 
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not YearsReplatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not YearsVMware Tanzu
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaCloudera, Inc.
 
Pivotal Big Data Roadshow
Pivotal Big Data Roadshow Pivotal Big Data Roadshow
Pivotal Big Data Roadshow VMware Tanzu
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesDataWorks Summit
 
OAC Workshop - Detroit 2019
OAC Workshop -  Detroit 2019OAC Workshop -  Detroit 2019
OAC Workshop - Detroit 2019Datavail
 
Architecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High PerformanceArchitecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High PerformanceSamanthaBerlant
 
Business Data Lake Best Practices
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best PracticesCapgemini
 

Similar to Top Trends in Building Data Lakes for Machine Learning and AI (20)

TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 
State of enterprise data science
State of enterprise data scienceState of enterprise data science
State of enterprise data science
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud Platform
 
How Analytics Optimize Migration to Amazon Web Services, Microsoft Azure and ...
How Analytics Optimize Migration to Amazon Web Services, Microsoft Azure and ...How Analytics Optimize Migration to Amazon Web Services, Microsoft Azure and ...
How Analytics Optimize Migration to Amazon Web Services, Microsoft Azure and ...
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
Modern Thinking área digital MSKM 21/09/2017
Modern Thinking área digital MSKM 21/09/2017Modern Thinking área digital MSKM 21/09/2017
Modern Thinking área digital MSKM 21/09/2017
 
BIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsBIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in Logistics
 
Achieve New Heights with Modern Analytics
Achieve New Heights with Modern AnalyticsAchieve New Heights with Modern Analytics
Achieve New Heights with Modern Analytics
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft Azure
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
 
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not YearsReplatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
 
Pivotal Big Data Roadshow
Pivotal Big Data Roadshow Pivotal Big Data Roadshow
Pivotal Big Data Roadshow
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
OAC Workshop - Detroit 2019
OAC Workshop -  Detroit 2019OAC Workshop -  Detroit 2019
OAC Workshop - Detroit 2019
 
Architecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High PerformanceArchitecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High Performance
 
Business Data Lake Best Practices
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best Practices
 
Higher ROI-N
Higher ROI-NHigher ROI-N
Higher ROI-N
 

Recently uploaded

5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopThinkInnovation
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Projectdanielbell861
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 

Recently uploaded (13)

5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI Desktop
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Project
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 

Top Trends in Building Data Lakes for Machine Learning and AI

  • 1. Big Data in the Cloud State of the Union and Future Trends Ashish Thusoo Qubole CEO & Co-Founder
  • 2. 00Copyright 2017 © Qubole About Me Today - CEO & Co-founder of Qubole ➢ Big Data Platform built for the Cloud • Scaling to process 700+ PB of data a month in popular Clouds (AWS, Microsoft Azure, Oracle) • Provides Auto-Orchestration and Self-Service for big data jobs (e.g. ETL, Machine Learning, Adhoc) • Cloud and workload optimized Spark, Hive, Hadoop and Presto Engines Background ➢ Started career at Oracle as a Principal Architect ➢ Ran Data Infrastructure team at Facebook from 2007-2011: • Built out the Self-service Big Data Platform at Facebook for internal operations • Saw huge growth, while chartered to provide all teams unified analytics (Marketing, Analytics, Engineering, Sales Finance, etc.) • Spawned developments of big data engines such as Apache Hive and Apache Presto ➢ Co-created and led Apache Hive project
  • 3. From Data Warehouse to Data Lake Evolution of Big Data
  • 4. 00Copyright 2017 © Qubole Changing Nature of Data
  • 5. 00Copyright 2017 © Qubole Changing Nature of Analytics Descriptive Analytics Diagnostic Analytic Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make it happen? DIFFICULTY VALUE Analytics Value Escalator
  • 6. 00Copyright 2017 © Qubole Breakdown of Data Warehouse – Emergence of Data Lake ENTERPRISE DATA WAREHOUSE ETL BI Data Mining DATA LAKE INGEST ELT ML Data Discovery Data Warehouse LEGACY DATA PLATFORM MODEREN DATA PLATFORM
  • 7. 00Copyright 2017 © Qubole Differences Between Data Lakes and Data Warehouses DATA LAKE vs DATA WAREHOUSE Semi-structured / unstructured / structured / raw DATA Structured data SQL / Machine Learning / ETL / Graph Analytics etc. ANALYTICS FLEXIBILITY SQL Cheap storage for large volumes of data VOLUME Expensive at large volumes of data High agility with ability to quickly reconfigure for new workloads AGILITY Fixed configuration and limited agility Data Engineers / Data Scientists / Analysts USERS Analysts / Business Users
  • 8. 00Copyright 2017 © Qubole Data Back Office with a Data Warehouse Centric Approach Marketing Ops DeveloperSales Ops Product Managers Data Infrastructure Data Team Operations Analyst AnalystData Architect Business Users Product Support Customer Support
  • 9. 00Copyright 2017 © Qubole Data Back Office Transformation with a Data Lake Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure
  • 10. State of Data Lake Adoption in Enterprises
  • 11. 00Copyright 2017 © Qubole The Five Stages of Data Lake Maturity 01 Stage Aspiration - Production Reporting/DW - Researching 02 Stage Experimentation - Initial Big Data Deployment - Targeted Use Case 03 Stage Expansion - Multiple Departments - Multiple Engines - Top Down Use Cases 04 Stage Inversion - Enterprise Transformation - Bottoms up use cases - 05 Stage Nirvana - Digital Enterprise - Ubiquitous Insights - True Business Transformation
  • 12. 00Copyright 2017 © Qubole Is there a plan in place to move to a big data self-service analytics model? Yes 65% No 35% Data Lake’s Reality Gap: Everyone wants self-service nirvana 65% Moving to Self-Service Model to Enable Data Professionals
  • 13. 00Copyright 2017 © Qubole How confident are you that the data team can achieve self-service analytics? 16% 39% 32% 11% 2% 0% 20% 40% 60% 80% 100% Extremely confident Very confident Confident Somewhat doubtful Extremely doubtful Data Lake’s Reality Gap: IT is confident they can get there 87% Confident They Can Provide Self-Service Analytics
  • 14. 00Copyright 2017 © Qubole Data Lake’s Reality Gap: Only 8% are there today How do you assess your big data maturity? In process but still have a lot to learn 35% Need to scale and optimize our processes 28% Still figuring it out 22% Fully mature process providing tools and access to insights for all stakeholders 8% We have proven processes but still can’t meet business demand 7% Only 8% Have Mature Big Data Processes
  • 15. 00Copyright 2017 © Qubole A Prescription to Success - Move to the Cloud Adaptability • Best machine configuration for the workload • Best Engine for the workload • On demand and elastic; automatically scale up or down Agility • Initial provisioning in min/hours, not months • Change configurations dynamically • Compute and Storage scale independently Cost • Pay only for what you actually use • Use spot instances to reduce cost by up to 80%
  • 16. Cloud Computing and Data Lakes Infrastructure as a Service
  • 17. 00Copyright 2017 © Qubole Changing Nature of the IT Infrastructure Mainframe Computing Client-Server Computing Cloud Computing 1960 1980 2000 2020
  • 18. 00Copyright 2017 © Qubole Cloud vs Data Centers • Infrastructure is an API – Therefore Infrastructure can with the Application App STATIC & RIGID FLEXIBLE & ELASTIC App
  • 19. 00Copyright 2017 © Qubole Properties of Data Lakes Big Data Analysis Is Bursty e.g. at Qubole we see on an average the minimum to maximum size of infrastructure to vary by 3400% Ever Expanding e.g. data processed on Qubole as grown 2.5x in a year Rapidly Evolving e.g. Spark, Presto and others addressing gaps in technology
  • 20. 00Copyright 2017 © Qubole Creating Data Lakes in the Cloud + 1. Faster Time to Production (Minutes and Hours instead of 6-18 months) 2. Increasing success in Big Data and Data Science projects 3. Agility and Flexibility of Infrastructure to change compute for workloads and technology 4. Significant Cost Reduction (Infrastructure fits avg utilization as opposed to peak)
  • 21. 00Copyright 2017 © Qubole Key Architecture – Separation of Compute and Storage COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE OBJECT STORAGE ELASTIC COMPUTE LEGACY DATA CENTER ARCHITECTURE CLOUD INFRASTRUCTURE ARCHITECTURE
  • 22. 00Copyright 2017 © Qubole Architecture – Putting Together Cloud Data Lakes
  • 23. 00Copyright 2017 © Qubole Cloud Data Lakes vs Data Center Data Lakes - Automation Cluster Lifecycle Management Auto start/terminate Auto-scaling up/down Performance Optimization Cluster re-balancing Performance/Caching Cost Optimization Spot node usage Resource substitution
  • 24. 00Copyright 2017 © Qubole Cloud Data Lakes vs Data Center Data Lakes - TCO 25 5 10 15 20 Clustersize(Nodes) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Time Minimum nodes Maximum nodes On Prem Cluster (fixed) Actual Compute Usage Cluster Lifecycle Savings (19% / 90%) Workload Aware Auto-Scaling Savings (48% / 90%) Spot Shopper Savings (15% / 45%)
  • 25. 00Copyright 2017 © Qubole Cloud Data Lakes vs Data Center Data Lakes – Concurrency and Elasticity select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) tgroup by t.county; insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip; OBJECT STORAGE ELASTIC COMPUTE
  • 26. 00Copyright 2017 © Qubole Bringing it Together Using the Cloud for Big Data Platforms and Data Science Operations is Fundamentally Different from Operating Big Data Platforms On-Premise. Done Properly This Leads to 1. Faster Time to Value 2. Increased Flexibility and Scale 3. Better Adoption of Analytics 4. Ability to Spend Smarter, and Not Less
  • 27. Example Cases of Success
  • 28. 00Copyright 2017 © Qubole Case Studies Reduced Time to Market for New Products - Ibotta, Inc. ➔ Ibotta was faced with the challenges and limitations of a cloud Data Warehouse, reaching constraints of being able to scale and leaving new data on the floor, which was prevent new features such as personalized app experiences and campaign optimizations. This drove building a datalake, being the source of ETL pipelines from ingestion, along with enabling Data Science and Product teams for feature engineering. Bursty ETL on Telemetry & User Data - ➔ Lyft crunches through all the data that an application that their drivers and riders create. Lyft generates 3-5x more data on weekends vs. weekdays. Their data platform has to be flexible enough to process a significant amount more on weekends in order to deliver the right insights back to their users.
  • 29. 00Copyright 2017 © Qubole Case Studies Data Governance for Customer 360 - ➔ Turner had a initiative for giving Engineering, Data Science and Analyst breaking down their data silos in order to provide a 360 view of their engagement from campaigns through viewership. Using QDS, they could easily set up governance and permissions in S3, eliminating the risk of exposing sensitive data to the wrong teams, while enabling them the right access for their tools. Data Science Development Workbench - ➔ Acxiom had a need to share data across different data science teams with the control to lock down certain users. With Qubole, they have built a “Data Science Playground” that allows various DS teams have a single environment for developing and testing models and algorithms before rolling them into production.
  • 30. 00Copyright 2017 © Qubole Right Tool for the Right Team Built for Anyone Who Uses Data Analysts Data Scientists Data Engineers Data Admins BI teams Executive Dashboards Products A Single Platform for Any Use Case Reporting Ad Hoc Queries Machine Learning Stream Processing Vertical Apps ETL Workflows Open Source Engines, Optimized for the Cloud Cloud-Native, Optimized for Agnostics Web App User Interface, SDK (Python, Java, R, Ruby APIs), ODBC/JDBC connectors (visualization software and databases) Workflows orchestration (Airflow, Oozie, Azkaban)
  • 32. 00Copyright 2017 © Qubole (Re)Emergence of Deep Learning
  • 33. 00Copyright 2017 © Qubole Deep Learning Applications Applications today focused in areas around • Image Recognition and Processing • Speech Recognition and Processing • NLP and Text Analysis
  • 34. 00Copyright 2017 © Qubole Emergence of New Use Cases and Technology Deep Learning Platforms are Emerging
  • 35. 00Copyright 2017 © Qubole Serverless Computing
  • 36. 00Copyright 2017 © Qubole Advantages of Serverless • Zero Administration • Fast Bursting • Cost Advantages
  • 37. 00Copyright 2017 © Qubole Emergence of Serverless Computing