Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Lakes on Public Cloud: Breaking Data Management Monoliths

Sharon Dashet (Sr. Data Analytics Solution Lead) @ Google Cloud:
The worlds of traditional RDBMS and Data Lake Hadoop systems are converging and moving to public cloud and SaaS offerings.
In this session, Sharon will share her personal journey as a data professional since the 90s weaved into the history of data management systems.
The session will also cover the differences between on-premise and cloud Data Lakes.

  • Be the first to comment

  • Be the first to like this

Data Lakes on Public Cloud: Breaking Data Management Monoliths

  1. 1. Data Lakes in the Public Cloud: Breaking Data Management Monoliths Sharon Dashet, Sr. Data Analytics Solution Lead, GCP
  2. 2. 01Intro to Data Lakes
  3. 3. It all started with RDBMS….
  4. 4. Traditional EDW players ~1995 Big data vendors ~2005 Cloud platform vendors ~2010 Specialized Cloud vendors ~2012 Data Management timeline Relational OLTP 80s Database Developer Backend Developer Application DBA Production DBA Data Scientist Data Analysts BI/OLAP Expert SQL Expert Governance MDM Big Data Developer Big Data Architect ML Engineer Hadoop Admin Hadoop Expert AI Scientist CDO Cloud Data Engineer Cloud Data Architect
  5. 5. HBase ( NoSQL datastore) Flume (Log aggregation and transport) Sqoop (Import and export of relational data) Ambari (Management and monitoring) MapReduce (Cluster data processing) YARN (Cluster resource management) HDFS (Hadoop Distributed File System) HCatalog (Metadata) Oozie (Workflow automation) Zookeeper (Coordination ) Pig (Scripting) Flink (Streams) Mahout & Spark ML (Machine learning) Presto (Distributed SQL query) (Cluster data processing) Hive (SQL DW) The Hadoop ecosystem is very popular for Big Data workloads
  6. 6. Multi-User, Shared Hadoop Cluster Data (HDFS) Temp Data (HDFS) Metadata (Hive metastore, RDBMS) AuthZ Policies, Audit, Governance (Ranger, Atlas) Compute: YARN Hive Spark MR R AuthN Kerberos, LDAP Kafka, Storm, Flume, Cassandra, Hbase, ELK etc. Typical on-premises deployment
  7. 7. The apache Data-Processing ecosystem
  8. 8. Resource utilization and overall TCO of on-prem data lakes becomes unmanageable Data governance and security issues open up compliance concerns Resource intensive data and analytics processing can lead to missed SLAs Analytics experimentation is slow due to resource provisioning time TCO Challenges Governance Challenges Agility ChallengesScaling Challenges On-prem Data Lakes are struggling to deliver value
  9. 9. Key market players are struggling to convert customers.
  10. 10. The need is still there AI is now capable of extracting value from unstructured data Cloud is faster, simpler to operate, and less expensive “80 percent of worldwide data will be unstructured by 2025” Data Lake are shifted to the cloud “By connecting data points, we can offer advice like hygiene laws for certain foods, or information on provenance. We can even integrate their local weather forecast so a store doesn't run out of ice cream on a sunny day." Sven Lipowski, Unit Owner Customer Solutions adMETERONOMIDC (source) “The ability to spin up purpose driven Hadoop clusters against our shared datasets and scale them up/down with demand is a game changer for us…” Brett Uyeshiro VP Platform Services, Pandora
  11. 11. 02 Patterns for Data Lakes in Public Cloud
  12. 12. Beyond HDFS- Storage and Compute separation Keep your storage on GCS instead of HDFS Benefits: ● Separation of Compute/Storage ● Full HDFS-compliant GCS connector ● Facilitates Job-scoped cost effective workloads (+ephemeral clusters) ● No need to provision x3 storage for replication ● No unused bytes on disks
  13. 13. Hive Analytics Business ReportingMapReduce ETL Machine Learning Storage Cloud Storage Hive Metastore Cloud Dataproc Clusters Job-Scoped Clusters - Beyond complicated Yarn queues ● Step away from complicated Yarn queues and multi tenancy ● Control cost and performance per workload: ○ Ephemeral Clusters ○ Mix regular and preemptible VMs in the worker pool ○ Different VM types
  14. 14. Beyond Yarn and into Modern Service Mesh
  15. 15. AI Platform Notebook s AI Platform AI Platform Notebook s 1. Data sources 2. Data Lake storage 3. Data Pipelines 4. Data Warehouse/Lake 5. ML and analytics workloads Converged Smart Analytics
  16. 16. Thank you