Today, data is everywhere. As more data streams into cloud-based systems, the combination of data and computing resources gives us today the unprecedented opportunity to perform very sophisticated data analysis and to explore advanced machine learning methods such as deep learning.
Clouds pack very large amount of computing and storage resources, which can be dynamically allocated to create powerful analytical environments. By accessing those analytics clusters of machines, data analysts and data scientists can quickly evaluate more hypotheses and scenarios in parallel and cost-effectively.
The number of analytical tools which is supported on various clouds is increasing by the day. The list of analytical tools spans from traditional rdms databases as provided by vendors to analytics open sources projects such as Hadoop Hive, Spark, H2O. Next to provisioning tools and solutions on the cloud, managed services for Data Science, Big Data and Analytics are becoming a popular offering of many clouds.
Analytics in the cloud provides whole new ways for data analysts, data scientists and business developer to interact with each other, share data and experiments and develop relevant insight towards improved business processes and results. In this talk, I will describe a number of data analytics solutions for the cloud and how they can be added to your current cloud and on-premise landscape.
2. 2 Natalino Busa - @natbusa
Distributed computing Machine Learning
Statistics Big/Fast Data Streaming Computing
Head of Applied Data Science at
Teradata
On most networks:
@natbusa
4. 4 Natalino Busa - @natbusa
Analytics in the cloud: stacking layers
Bare Metal: Physical Machines
5. 5 Natalino Busa - @natbusa
Analytics in the cloud: stacking layers
Bare Metal: Physical Machines
IAAS: Virtual Resources
6. 6 Natalino Busa - @natbusa
Analytics in the cloud: stacking layers
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
7. 7 Natalino Busa - @natbusa
Analytics in the cloud: stacking layers
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
dPAAS: Datastores, Data Engines
iPAAS: Tools Integration, Flows & Processes
8. 8 Natalino Busa - @natbusa
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
dPAAS: Datastores, Data Engines
iPAAS: Tools Integration, Flows & Processes
DAAAS: Data Analytics as a Service
Watson
Services
Azure ML
Google
Cloud MLBigML
Analytics in the cloud: stacking layers
9. 9 Natalino Busa - @natbusa
Analytics in the cloud: today’s talk
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
dPAAS: Datastores, Data Engines
iPAAS: Tools Integration, Flows & Processes
DAAAS: Data Analytics as a Service
10. 10 Natalino Busa - @natbusa
“we live in an age of open source datacenters, so
we can stack all these things together and we
have open source from the ground to ceiling.”
Sam Ramji, CEO of Cloud Foundry
https://www.youtube.com/watch?v=7oCSFcUW-Qk
21. 21 Natalino Busa - @natbusa
PaaS: Advanced Analytics
AI and Deep Learning
- Unstructured Data
- Object Detection
- Natural Language Processing
- Video Summarization
- Speech Recognition
23. 23 Natalino Busa - @natbusa
dPaaS: Machine (deep) Learning
… this are just a few examples ...
24. 24 Natalino Busa - @natbusa
Analytics Everywhere
Public Cloud Managed Cloud Private Cloud Private Infra
25. 25 Natalino Busa - @natbusa
iPaas: Components for Analytics in the Cloud
SQL : Big Data
Data Warehousing
NoSQL
Machine LearningObjects Stores
Streaming
Computing
SQL: Relational
Transactional DB
29. 29 Natalino Busa - @natbusa
iPaas, dPaaS:
SQL : Big Data
Data Warehousing
NoSQLObjects
Stores
SQL: Relational
Transactional DB
HDFS
GlusterFS
CephFS
NFS
Swift
Nova
Cassandra
Redis
S3 (AWS)
Storage (GCP)
...
MySQL
PostgreSQL
MariaDB
Oracle (AWS MP)
Hive
Presto
Spark SQL
Impala
Redshift (AWS)
BigQuery (GCP)
Big SQL (IBM)
Teradata (AWS MP)
SAP Hana(AWS MP)
Vertica (AWS MP)
Cassandra
Redis
HBase
Accumulo
Neo4J
ElasticSearch
MongoDB
Couchbase
BigTable (GCP)
DynamoDB
30. 30 Natalino Busa - @natbusa
iPaas, dPaaS:
SQL : Big Data
Data Warehousing
NoSQL Machine
Learning
Objects
Stores
SQL: Relational
Transactional DB
HDFS
GlusterFS
CephFS
NFS
Swift
Nova
Cassandra
Redis
S3 (AWS)
Storage (GCP)
...
MySQL
PostgreSQL
MariaDB
Oracle (AWS MP)
Hive
Presto
Spark SQL
Impala
Redshift (AWS)
BigQuery (GCP)
Big SQL (IBM)
Teradata (AWS MP)
SAP Hana(AWS MP)
Vertica (AWS MP)
Cassandra
Redis
HBase
Accumulo
Neo4J
ElasticSearch
MongoDB
Couchbase
BigTable (GCP)
DynamoDB
Spark ML
H2O
Flink
Areosolve
Theano
Tensorflow
XGboost
Azure ML
AWS ML
Google ML
IBM Watson
31. 31 Natalino Busa - @natbusa
iPaas, dPaaS:
SQL : Big Data
Data Warehousing
NoSQL Machine
Learning
Objects
Stores
Streaming
Computing
SQL: Relational
Transactional DB
HDFS
GlusterFS
CephFS
NFS
Swift
Nova
Cassandra
Redis
S3 (AWS)
Storage (GCP)
...
MySQL
PostgreSQL
MariaDB
Oracle (AWS MP)
Hive
Presto
Spark SQL
Impala
Redshift (AWS)
BigQuery (GCP)
Big SQL (IBM)
Teradata (AWS MP)
SAP Hana(AWS MP)
Vertica (AWS MP)
Cassandra
Redis
HBase
Accumulo
Neo4J
ElasticSearch
MongoDB
Couchbase
BigTable (GCP)
DynamoDB
Spark ML
H2O
Flink
Areosolve
Theano
Tensorflow
XGboost
Azure ML
AWS ML
Google ML
IBM Watson
Heron (Storm)
NiFi
Spark Streaming
Flink
Kafka Streams
Logstash
StreamSQL
Google DataFlow
(GCP)
32. 32 Natalino Busa - @natbusa
iPaaS: Selecting your Analytical Stack
Flexible. Powerful.
- Combinations for this example:
8 * 3 * 4 * 8 * 7 * 7 = 37632
Right tool for the right job
- Fit for purpose
- Multi-Genre Analytics
Hard to maintain and upgrade:
- Extended Skills and Know-how
- Components upgrades must be compatible
Hard to configure:
- no matter if cloud or bare or vms
- complex stacks with many tools and services
33. 33 Natalino Busa - @natbusa
iPaaS: Deploy & Manage your own Analytics
How to simplify? Select a bundle!
34. 34 Natalino Busa - @natbusa
iPaaS: bundled recipes & stacks
Select a recipe:
- Hortonworks Data Platform
- Cloudera Data Platform
- Reactive Platform
- Smack Stack
- Pancake Stack
- ELK Stack
- Select your own
35. 35 Natalino Busa - @natbusa
iPaaS: my favs analytical stacks
Objects
Stores
NoSQL SQL : Big Data
Data Warehousing
Machine Learning Streaming
Computing
All Hadoop (5) HDFS Hbase Hive Spark Storm
Smack stack (2) Cassandra Cassandra Spark Spark Spark
Elastic (5) HDFS ElasticSearch Hive H2O Kafka
Data Science (8) HDFS ElasticSearch Hive, Presto Spark, H2O, Tensorflow Flink
Real Time (2) Cassandra Cassandra Flink Flink Flink
36. 36 Natalino Busa - @natbusa
dPaaS: Managed Analytics
This is hard ! Can we access it as a service?
37. 37 Natalino Busa - @natbusa
dPaaS: Managed Hadoop & Spark
HDInsight: Hadoop, Spark, and R as services
Managed Spark Clusters, BigInsight (Hadoop)
DataFlow and DataProc: Flink, Spark and
Hadoop Clusters as a Service
EMR: Hadoop components a la carte
38. 38 Natalino Busa - @natbusa
PaaS: Analytical clusters
Ephemeral
Create then Dispose
Clusters are Short-Lived
Data Exploration
Isolated, Personal
Simple Access Management
Interactive Analytics
Permanent
Clusters are Long Lived
Scheduled Operations
Production ETL
Co-Ordinated
Complex Access Management
Batch Analytics
vs
39. 39 Natalino Busa - @natbusa
DAaaS: Microsoft’s Cortana and ML Studio
41. 41 Natalino Busa - @natbusa
DAaaS: Google ML and AI as a service
Cloud Computing for
Deep Neural Networks
> Train, Score, Data
AI and ML models for:
● Speech (audio)
● Language (text)
● Vision (images/video)
42. 42 Natalino Busa - @natbusa
Summary
• Analytics in the Cloud:
The dawn of a new computing era
• IPaas, dPaas:
complexity vs flexibility, it’s a tradeoff
• Computing clusters:
Ephemeral and Persistent
43. 43 Natalino Busa - @natbusa
Head of Applied Data Science at
Teradata
Distributed computing Machine Learning
Statistics Big/Fast Data Streaming Computing
Linkedin and Twitter:
natbusa