SlideShare a Scribd company logo
1 of 48
Download to read offline
EMERGING TRENDS IN DATA
ANALYTICS
Wei-Chiu Chuang, Ph.D.
HDFS Lead Engineer, Cloudera | Apache Hadoop PMC/Committer
© 2019 Cloudera, Inc. All rights reserved. 2
Hadoop HDFS
Lead Engineer
Committer/PMC
Founding
member
© 2019 Cloudera, Inc. All rights reserved. 3
”BIG DATA” IS PASSÉ
https://trends.google.com/trends/explore?date=today%205-y&q=Big%20Data,Data%20Analytics
LANDSCAPE OF COMMERCIAL OPEN
SOURCE DATA ANALYTICS SOFTWARE
© 2019 Cloudera, Inc. All rights reserved. 5
A YEAR OF TECTONIC SHIFT
Merged
Acquired by
Acquired by
© 2019 Cloudera, Inc. All rights reserved. 6
OPEN SOURCE DATA ANALYTICS SOFTWARE UNICORNS
© 2019 Cloudera, Inc. All rights reserved. 7
SUB-BILLION, UNICORNS-TO-BE
3.7
5
4
Lucidworks neo4j H2O.AI
Valuation
Valuation ($100 M)
DATA ENGINEER COMMUNITY
© 2019 Cloudera, Inc. All rights reserved. 9
IS HADOOP DEAD?
© 2019 Cloudera, Inc. All rights reserved. 10
MOST ACTIVE
VISITS AND
DOWNLOADS
• Hadoop web pages
are the most popular
among all Apache
projects.
Apache Software Foundation Annual Report 2019
© 2019 Cloudera, Inc. All rights reserved. 11
MAPREDUCE IS
DEAD;
LONG LIVE HDFS
AND YARN
• Stack Overflow
Trends
• HDFS and YARN are
mature
© 2019 Cloudera, Inc. All rights reserved. 12
• Hadoop is the on-prem platform for Big Data Analytics.
• Like Linux. Boring, but it’s the foundation.
© 2019 Cloudera, Inc. All rights reserved. 13
COMPUTE ENGINES
IN HADOOP-
ECOSYSTEM
• Stack Overflow
Trends
• Spark most popular
• Hive stable
• MapReduce, Pig and
Storm are dead.
https://insights.stackoverflow.com/trends?tags=apache-
spark%2Chadoop%2Chive%2Cmapreduce%2Capache-pig%2Capache-
storm&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-SwDQ2_AT4kpIvBn1NjuLYhdGc
© 2019 Cloudera, Inc. All rights reserved. 14
BIG DATA: YOUR
NAME IS SQL
• Stack Overflow Trends
• Hive was the most
popular until 2018.
• SparkSQL grew fastest
until 2018.
• Cloud: BigQuery is more
popular than Redshift
© 2019 Cloudera, Inc. All rights reserved. 16
BATCH VS REAL-TIME
© 2019 Cloudera, Inc. All rights reserved. 17
KAFKA
• Message broker >>
stream processing
© 2019 Cloudera, Inc. All rights reserved. 18
STREAM
PROCESSING
• Stack Overflow Trends
⎯ (exclude Kafka
Streams)
• Flink grows fastest;
Beam too
• Spark Streaming
declining
• Storm is dead
https://insights.stackoverflow.com/trends?tags=spark-streaming%2Capache-flink%2Capache-
storm%2Capache-beam&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-
SwDQ2_AT4kpIvBn1NjuLYhdGc
© 2019 Cloudera, Inc. All rights reserved. 20
SPARK
• Stack Overflow Trends
• Spark no longer the cool kid
• You will write in PySpark or
SparkSQL.
• Spark Streaming is declining
• Very little people develop ML
with Spark.
⎯ Wait for Spark 3.0?
• batch >> streaming https://insights.stackoverflow.com/trends?tags=apache-spark%2Cpyspark%2Cspark-
streaming%2Cspark-dataframe%2Cpyspark-sql%2Capache-spark-sql%2Capache-spark-
mllib&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-SwDQ2_AT4kpIvBn1NjuLYhdGc
© 2019 Cloudera, Inc. All rights reserved. 21
LANGUAGE
• Python > Java > C++
> Go
https://insights.stackoverflow.com/trends?tags=python%2Cgo%2Cjava%2Cc%2B%2B
© 2019 Cloudera, Inc. All rights reserved. 22
TRENDY
TECHNOLOGIES
• Stack Overflow Trends
https://insights.stackoverflow.com/trends?tags=tensorflow%2Ckubernetes%2Capache-
spark%2Capache-kafka%2Cdocker&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-
SwDQ2_AT4kpIvBn1NjuLYhdGc
© 2019 Cloudera, Inc. All rights reserved. 23
SUMMARY
• Deep learning (Tensorflow)
• Micro services (Kubernetes, Docker, Kafka)
• Batch > Streaming, but Streaming is gaining traction
• Python
DEVELOPER COMMUNITY
© 2019 Cloudera, Inc. All rights reserved. 25
APACHE SOFTWARE FOUNDATION
20 years anniversary
300+ projects
48 incubating projects
FY2019:
187 k commits
3215 committers
© 2019 Cloudera, Inc. All rights reserved. 26
CNCF
• 6 graduated projects
• 16 incubating projects
• 18 sandbox projects
• Last year
⎯ 141 k commits last year
⎯ 7647 committers
© 2019 Cloudera, Inc. All rights reserved. 27
APACHE VS CNCF
• Big Data, Database, Cloud
• Contributors are individuals
• DevOps tools
• Contributors are associated with the
companies
© 2019 Cloudera, Inc. All rights reserved. 28
IMPACT OF CLDR-HWX MERGER
• 63% Hadoop commits are made by Cloudera employees in 2019.
• Community development
• Bad news: Apache Ambari, Apache Sentry
• Good news: Hive support for Kudu, Ranger support for Impala
© 2019 Cloudera, Inc. All rights reserved. 29
DEVELOPER COMMUNITY MOVING TO ASIA
• HBase
⎯ HBaseCon Asia
• Hadoop
⎯ 1st ever Hadoop Meetup in China
• China is the third largest contributor to
CNCF projects.
⎯ 3 projects were born in China
Apache github visits
© 2019 Cloudera, Inc. All rights reserved. 30
MACHINE LEARNING
● Apache Hadoop Submarine
● Submarine Project.
● Distributed machine learning platform
● algorithm development, model batch training, model incremental training, model online services and
model management
⎯ Available since: 3.2.0 (As part of YARN)
⎯ Become top level subproject: 0.2.0 (Separate release)
⎯ Lots of new stuff in 0.3.0.
https://hadoop.apache.org/submarine/
MEASURING THE HEALTH OF
DEVELOPER COMMUNITY
© 2019 Cloudera, Inc. All rights reserved. 32
CREATED/RESOLVED JIRAS
© 2019 Cloudera, Inc. All rights reserved. 33
NUMBER OF CONTRIBUTORS
https://www.openhub.net/p/apache-spark https://www.openhub.net/p/mongodb
Apache Spark MongoDB
© 2019 Cloudera, Inc. All rights reserved. 35
DIVERSITY (AFFILIATION) OF KUBERNETES DEVELOPERShttps://k8s.devstats.cncf.io/d/8/company-statistics-by-repository-group?orgId=1&var-period=m&var-
metric=prs&var-repogroup_name=All&var-companies=All
© 2019 Cloudera, Inc. All rights reserved. 37
APACHE APEX (2016 - 2018)
The story of an abandoned project
• “Enterprise-grade unified stream and batch
processing engine.”
• Founded in April, 2016.
• Backed by DataTorrent, collapsed May,
2018 after raising $23.9 million.
© 2019 Cloudera, Inc. All rights reserved. 38
APACHE APEX
• https://reporter.apache.org/wizar
d/statistics?apex
• Last new PMC was on 2018-05-
15.
• Last new committer was 2017-
10-19.
• Community Health Score (Chi):
-4.28 (Action required!)
WILL CLOUD KILL OPEN SOURCE?
© 2019 Cloudera, Inc. All rights reserved. 40
OPEN SOURCE VS.
PROPRIETARY
• Popularity of open source data
systems is about to take over
proprietary systems for the first
time.
• Why open source?
⎯ Free
⎯ Innovation
⎯ Industry standard
Source: https://db-engines.com/en/ranking_osvsc
© 2019 Cloudera, Inc. All rights reserved. 41
AMAZON’ED: CLOUD PROVIDERS THREATENING OSS VENDORS
Redis Labs
AGPL
Redis Source Available
License
MongoDB
AGPL
Server Side Public
License
Confluent
Apache 2.0
Confluent Community
License
Cockroach
LabsApache 2.0
Business Source
License
© 2019 Cloudera, Inc. All rights reserved. 42
ALL IS NOT LOST
Cloudera will be 100% open source
• Hadoop, Spark, Kafka, ...
Apache Software License 2.0
• Cloudera Manager
• Cloudera Data Science Workbench
• Cloud service
Proprietary à AGPL
© 2019 Cloudera, Inc. All rights reserved. 43
CLOUD VENDORS’ OSS STRATEGY
Amazon AWS
EMR, Open Distro for Elasticsearch,
DocumentDB, Amazon MSK
Microsoft Azure
Partnership:
HDInsight, Azure Databricks, Azure Red
Hat Openshift
Google GCP
Partnership:
Confluent, DataStax, Elastic, InfluxData,
MongoDB, Neo4j, Redis Labs
CLOUD NATIVE, CLOUD FIRST
© 2019 Cloudera, Inc. All rights reserved. 46
KUBERNETES IS THE NEW OPERATING SYSTEM
KubeCon attendance
YuniKorn
FUTURE: HYBRID CLOUD
© 2019 Cloudera, Inc. All rights reserved. 63
HYBRID CLOUD IS THE NEW NORM
• Public cloud deployments will capture most of growth.
• On-prem deployments will still exist, for niche use cases.
⎯ Regulation (FinServ, Healthcare)
⎯ High density (>100 TB per node)
⎯ Specialized hardware (100Gbps NIC, GPU, FPGA, NVMe, Vector Engine)
TAKEAWAY
© 2019 Cloudera, Inc. All rights reserved. 65
TAKEAWAY
Big Data à Data Analytics
Commercial open source software market is booming
Don’t bet on a single open source software
Cloud vendors will find a balance with OSS vendors
Hybrid cloud
WE ARE HIRING!
https://www.cloudera.com/careers/teams/engineering.html
(remote positions available)
THANK YOU
Scan me!

More Related Content

What's hot

Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersDataWorks Summit/Hadoop Summit
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersDataWorks Summit
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIsCisco DevNet
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduGrant Henke
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...DataWorks Summit
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Cloudera, Inc.
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightDataWorks Summit
 

What's hot (20)

Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIs
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera

 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Hadoop Everywhere
Hadoop EverywhereHadoop Everywhere
Hadoop Everywhere
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsight
 

Similar to Emerging trends in data analytics

GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC MeetupTimothy Spann
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsTimothy Spann
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfTimothy Spann
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AITimothy Spann
 
Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10Timothy Spann
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel AlertsTimothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017Cloudera Japan
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupAndrei Savu
 
One Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsCloudera, Inc.
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Timothy Spann
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 

Similar to Emerging trends in data analytics (20)

GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AI
 
Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
 
One Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple Clouds
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 

Recently uploaded

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 

Emerging trends in data analytics

  • 1. EMERGING TRENDS IN DATA ANALYTICS Wei-Chiu Chuang, Ph.D. HDFS Lead Engineer, Cloudera | Apache Hadoop PMC/Committer
  • 2. © 2019 Cloudera, Inc. All rights reserved. 2 Hadoop HDFS Lead Engineer Committer/PMC Founding member
  • 3. © 2019 Cloudera, Inc. All rights reserved. 3 ”BIG DATA” IS PASSÉ https://trends.google.com/trends/explore?date=today%205-y&q=Big%20Data,Data%20Analytics
  • 4. LANDSCAPE OF COMMERCIAL OPEN SOURCE DATA ANALYTICS SOFTWARE
  • 5. © 2019 Cloudera, Inc. All rights reserved. 5 A YEAR OF TECTONIC SHIFT Merged Acquired by Acquired by
  • 6. © 2019 Cloudera, Inc. All rights reserved. 6 OPEN SOURCE DATA ANALYTICS SOFTWARE UNICORNS
  • 7. © 2019 Cloudera, Inc. All rights reserved. 7 SUB-BILLION, UNICORNS-TO-BE 3.7 5 4 Lucidworks neo4j H2O.AI Valuation Valuation ($100 M)
  • 9. © 2019 Cloudera, Inc. All rights reserved. 9 IS HADOOP DEAD?
  • 10. © 2019 Cloudera, Inc. All rights reserved. 10 MOST ACTIVE VISITS AND DOWNLOADS • Hadoop web pages are the most popular among all Apache projects. Apache Software Foundation Annual Report 2019
  • 11. © 2019 Cloudera, Inc. All rights reserved. 11 MAPREDUCE IS DEAD; LONG LIVE HDFS AND YARN • Stack Overflow Trends • HDFS and YARN are mature
  • 12. © 2019 Cloudera, Inc. All rights reserved. 12 • Hadoop is the on-prem platform for Big Data Analytics. • Like Linux. Boring, but it’s the foundation.
  • 13. © 2019 Cloudera, Inc. All rights reserved. 13 COMPUTE ENGINES IN HADOOP- ECOSYSTEM • Stack Overflow Trends • Spark most popular • Hive stable • MapReduce, Pig and Storm are dead. https://insights.stackoverflow.com/trends?tags=apache- spark%2Chadoop%2Chive%2Cmapreduce%2Capache-pig%2Capache- storm&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-SwDQ2_AT4kpIvBn1NjuLYhdGc
  • 14. © 2019 Cloudera, Inc. All rights reserved. 14 BIG DATA: YOUR NAME IS SQL • Stack Overflow Trends • Hive was the most popular until 2018. • SparkSQL grew fastest until 2018. • Cloud: BigQuery is more popular than Redshift
  • 15. © 2019 Cloudera, Inc. All rights reserved. 16 BATCH VS REAL-TIME
  • 16. © 2019 Cloudera, Inc. All rights reserved. 17 KAFKA • Message broker >> stream processing
  • 17. © 2019 Cloudera, Inc. All rights reserved. 18 STREAM PROCESSING • Stack Overflow Trends ⎯ (exclude Kafka Streams) • Flink grows fastest; Beam too • Spark Streaming declining • Storm is dead https://insights.stackoverflow.com/trends?tags=spark-streaming%2Capache-flink%2Capache- storm%2Capache-beam&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y- SwDQ2_AT4kpIvBn1NjuLYhdGc
  • 18. © 2019 Cloudera, Inc. All rights reserved. 20 SPARK • Stack Overflow Trends • Spark no longer the cool kid • You will write in PySpark or SparkSQL. • Spark Streaming is declining • Very little people develop ML with Spark. ⎯ Wait for Spark 3.0? • batch >> streaming https://insights.stackoverflow.com/trends?tags=apache-spark%2Cpyspark%2Cspark- streaming%2Cspark-dataframe%2Cpyspark-sql%2Capache-spark-sql%2Capache-spark- mllib&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-SwDQ2_AT4kpIvBn1NjuLYhdGc
  • 19. © 2019 Cloudera, Inc. All rights reserved. 21 LANGUAGE • Python > Java > C++ > Go https://insights.stackoverflow.com/trends?tags=python%2Cgo%2Cjava%2Cc%2B%2B
  • 20. © 2019 Cloudera, Inc. All rights reserved. 22 TRENDY TECHNOLOGIES • Stack Overflow Trends https://insights.stackoverflow.com/trends?tags=tensorflow%2Ckubernetes%2Capache- spark%2Capache-kafka%2Cdocker&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y- SwDQ2_AT4kpIvBn1NjuLYhdGc
  • 21. © 2019 Cloudera, Inc. All rights reserved. 23 SUMMARY • Deep learning (Tensorflow) • Micro services (Kubernetes, Docker, Kafka) • Batch > Streaming, but Streaming is gaining traction • Python
  • 23. © 2019 Cloudera, Inc. All rights reserved. 25 APACHE SOFTWARE FOUNDATION 20 years anniversary 300+ projects 48 incubating projects FY2019: 187 k commits 3215 committers
  • 24. © 2019 Cloudera, Inc. All rights reserved. 26 CNCF • 6 graduated projects • 16 incubating projects • 18 sandbox projects • Last year ⎯ 141 k commits last year ⎯ 7647 committers
  • 25. © 2019 Cloudera, Inc. All rights reserved. 27 APACHE VS CNCF • Big Data, Database, Cloud • Contributors are individuals • DevOps tools • Contributors are associated with the companies
  • 26. © 2019 Cloudera, Inc. All rights reserved. 28 IMPACT OF CLDR-HWX MERGER • 63% Hadoop commits are made by Cloudera employees in 2019. • Community development • Bad news: Apache Ambari, Apache Sentry • Good news: Hive support for Kudu, Ranger support for Impala
  • 27. © 2019 Cloudera, Inc. All rights reserved. 29 DEVELOPER COMMUNITY MOVING TO ASIA • HBase ⎯ HBaseCon Asia • Hadoop ⎯ 1st ever Hadoop Meetup in China • China is the third largest contributor to CNCF projects. ⎯ 3 projects were born in China Apache github visits
  • 28. © 2019 Cloudera, Inc. All rights reserved. 30 MACHINE LEARNING ● Apache Hadoop Submarine ● Submarine Project. ● Distributed machine learning platform ● algorithm development, model batch training, model incremental training, model online services and model management ⎯ Available since: 3.2.0 (As part of YARN) ⎯ Become top level subproject: 0.2.0 (Separate release) ⎯ Lots of new stuff in 0.3.0. https://hadoop.apache.org/submarine/
  • 29. MEASURING THE HEALTH OF DEVELOPER COMMUNITY
  • 30. © 2019 Cloudera, Inc. All rights reserved. 32 CREATED/RESOLVED JIRAS
  • 31. © 2019 Cloudera, Inc. All rights reserved. 33 NUMBER OF CONTRIBUTORS https://www.openhub.net/p/apache-spark https://www.openhub.net/p/mongodb Apache Spark MongoDB
  • 32. © 2019 Cloudera, Inc. All rights reserved. 35 DIVERSITY (AFFILIATION) OF KUBERNETES DEVELOPERShttps://k8s.devstats.cncf.io/d/8/company-statistics-by-repository-group?orgId=1&var-period=m&var- metric=prs&var-repogroup_name=All&var-companies=All
  • 33. © 2019 Cloudera, Inc. All rights reserved. 37 APACHE APEX (2016 - 2018) The story of an abandoned project • “Enterprise-grade unified stream and batch processing engine.” • Founded in April, 2016. • Backed by DataTorrent, collapsed May, 2018 after raising $23.9 million.
  • 34. © 2019 Cloudera, Inc. All rights reserved. 38 APACHE APEX • https://reporter.apache.org/wizar d/statistics?apex • Last new PMC was on 2018-05- 15. • Last new committer was 2017- 10-19. • Community Health Score (Chi): -4.28 (Action required!)
  • 35. WILL CLOUD KILL OPEN SOURCE?
  • 36. © 2019 Cloudera, Inc. All rights reserved. 40 OPEN SOURCE VS. PROPRIETARY • Popularity of open source data systems is about to take over proprietary systems for the first time. • Why open source? ⎯ Free ⎯ Innovation ⎯ Industry standard Source: https://db-engines.com/en/ranking_osvsc
  • 37. © 2019 Cloudera, Inc. All rights reserved. 41 AMAZON’ED: CLOUD PROVIDERS THREATENING OSS VENDORS Redis Labs AGPL Redis Source Available License MongoDB AGPL Server Side Public License Confluent Apache 2.0 Confluent Community License Cockroach LabsApache 2.0 Business Source License
  • 38. © 2019 Cloudera, Inc. All rights reserved. 42 ALL IS NOT LOST Cloudera will be 100% open source • Hadoop, Spark, Kafka, ... Apache Software License 2.0 • Cloudera Manager • Cloudera Data Science Workbench • Cloud service Proprietary à AGPL
  • 39. © 2019 Cloudera, Inc. All rights reserved. 43 CLOUD VENDORS’ OSS STRATEGY Amazon AWS EMR, Open Distro for Elasticsearch, DocumentDB, Amazon MSK Microsoft Azure Partnership: HDInsight, Azure Databricks, Azure Red Hat Openshift Google GCP Partnership: Confluent, DataStax, Elastic, InfluxData, MongoDB, Neo4j, Redis Labs
  • 41. © 2019 Cloudera, Inc. All rights reserved. 46 KUBERNETES IS THE NEW OPERATING SYSTEM KubeCon attendance
  • 44. © 2019 Cloudera, Inc. All rights reserved. 63 HYBRID CLOUD IS THE NEW NORM • Public cloud deployments will capture most of growth. • On-prem deployments will still exist, for niche use cases. ⎯ Regulation (FinServ, Healthcare) ⎯ High density (>100 TB per node) ⎯ Specialized hardware (100Gbps NIC, GPU, FPGA, NVMe, Vector Engine)
  • 46. © 2019 Cloudera, Inc. All rights reserved. 65 TAKEAWAY Big Data à Data Analytics Commercial open source software market is booming Don’t bet on a single open source software Cloud vendors will find a balance with OSS vendors Hybrid cloud