Sharon Dashet (Sr. Data Analytics Solution Lead) @ Google Cloud:
The worlds of traditional RDBMS and Data Lake Hadoop systems are converging and moving to public cloud and SaaS offerings.
In this session, Sharon will share her personal journey as a data professional since the 90s weaved into the history of data management systems.
The session will also cover the differences between on-premise and cloud Data Lakes.
Data Lakes on Public Cloud: Breaking Data Management Monoliths
1. Data Lakes in the Public Cloud:
Breaking Data Management Monoliths
Sharon Dashet, Sr. Data Analytics Solution Lead, GCP
https://il.linkedin.com/in/sharon-dashet
4. Traditional EDW
players
~1995
Big data
vendors
~2005
Cloud
platform vendors
~2010
Specialized
Cloud vendors
~2012
Data Management timeline
Relational
OLTP
80s
Database Developer
Backend Developer
Application DBA
Production DBA
Data Scientist
Data Analysts
BI/OLAP Expert
SQL Expert
Governance
MDM
Big Data Developer
Big Data Architect
ML Engineer
Hadoop Admin
Hadoop Expert
AI Scientist
CDO
Cloud Data Engineer
Cloud Data Architect
5. HBase
( NoSQL
datastore)
Flume
(Log aggregation and
transport)
Sqoop
(Import and export of
relational data)
Ambari
(Management and
monitoring)
MapReduce (Cluster data processing)
YARN (Cluster resource management)
HDFS (Hadoop Distributed File System)
HCatalog (Metadata)
Oozie
(Workflow
automation)
Zookeeper
(Coordination )
Pig (Scripting) Flink (Streams)
Mahout & Spark ML
(Machine learning)
Presto
(Distributed SQL query)
(Cluster data processing)
Hive (SQL DW)
The Hadoop ecosystem is very popular for Big
Data workloads
6. Multi-User, Shared Hadoop Cluster
Data
(HDFS)
Temp
Data
(HDFS)
Metadata
(Hive metastore,
RDBMS)
AuthZ Policies,
Audit,
Governance
(Ranger, Atlas)
Compute: YARN
Hive Spark MR R
AuthN
Kerberos,
LDAP
Kafka, Storm,
Flume,
Cassandra,
Hbase, ELK etc.
Typical on-premises deployment
8. Resource utilization and overall
TCO of on-prem data lakes
becomes unmanageable
Data governance and security issues open up
compliance concerns
Resource intensive data and
analytics processing can lead to
missed SLAs
Analytics experimentation is slow
due to resource provisioning time
TCO Challenges Governance Challenges
Agility ChallengesScaling Challenges
On-prem Data Lakes are struggling to deliver value
10. The need is still there
AI is now capable of extracting
value from unstructured data
Cloud is faster, simpler to
operate, and less expensive
“80 percent of
worldwide data will be
unstructured by 2025”
Data Lake are shifted to the cloud
“By connecting data points, we can
offer advice like hygiene laws for
certain foods, or information on
provenance. We can even integrate
their local weather forecast so a store
doesn't run out of ice cream on a
sunny day."
Sven Lipowski, Unit Owner Customer
Solutions adMETERONOMIDC (source)
“The ability to spin up purpose
driven Hadoop clusters against our
shared datasets and scale them
up/down with demand is a game
changer for us…”
Brett Uyeshiro VP Platform Services,
Pandora
12. Beyond HDFS- Storage and Compute separation
Keep your storage on GCS instead of HDFS Benefits:
● Separation of Compute/Storage
● Full HDFS-compliant GCS connector
● Facilitates Job-scoped cost effective workloads
(+ephemeral clusters)
● No need to provision x3 storage for replication
● No unused bytes on disks
13. Hive Analytics Business ReportingMapReduce ETL Machine Learning
Storage
Cloud Storage
Hive Metastore
Cloud Dataproc
Clusters
Job-Scoped Clusters - Beyond complicated Yarn queues
● Step away from
complicated Yarn
queues and multi
tenancy
● Control cost and
performance per
workload:
○ Ephemeral
Clusters
○ Mix regular and
preemptible VMs
in the worker
pool
○ Different VM
types